service level agrement

Service Level Agreements for Carrier-Grade Clouds: beyond best-effort

Service Level Agreements (SLAs) are a the nuts and bolts of a business agreement and next to QoS and Security the final pillar of a Carrier-Grade Cloud offering. SLAs define functional and non-functional conditions under which the service should be delivered. They allow for penalties or compensations to be directly derived.

Ironically these SLAs have historically been best-effort, static (sometimes paper) constructs, taking only the top-level interface into consideration. Service-oriented infrastructures however are built on complex n-tier architectures usually reside (at least in part) under control of external providers.

In practice there is very little a consumer of such a service can do, to prove a violation to the provider.

Lack of transparency and SLA monitoring is the biggest hurdle for risk-management departments who need to decide if it’s safe to move a critical part of their business to the cloud, and is amplified even further with Carrier-Grade Cloud platforms.

Best-effort SLAs are a relic from how services and software was sold in the 90ies and has no place in a modern XaaS delivery.

Currently, service providers cannot plan their service landscapes using the SLAs of dependent services. They have no means by which to determine why a certain SLA violation might have occurred, or how to express an associated penalty.

SLA terms are not explicitly related to measurable metrics, nor is their relation to lower-level services clearly defined. As a consequence, service providers cannot determine the necessary (lower-level) monitoring required to ensure top-level SLAs. This missing relationship between top-level SLAs and (lower-level) metrics is a major hurdle to efficient service planning and prediction in service stacks.

The challenge and opportunity for cloud computing businesses is to move beyond best-effort SLAs and offer services with:

  1. adjustable (machine-readable) SLAs
  2. the ability to negotiate tailor-made agreements in an automated fashion
  3. consistent SLA management across all layers of the technology stack and include the external (Tier-2) stakeholders involved in the delivery
  4. span across all domains, including Security, QoS, Availability
  5. integrate into Billing and Customer-Experience-Management (CEM) and tools for measure Quality-of-Experience (QoE)
  6. transparent monitoring of SLA violations and more intelligent billing
  7. awareness of self-healing, self-organizing & self-configuring systems
  8. integrate into systems which affect the provisioning such as configuration management and Life-Cycle systems

Further Reading:

  1. SLA@SOI EU project includes links Open Source software for SLA automation
  2. Cloud Computing SLAs: http://ec.europa.eu/information_society/newsroom/cf/dae/document.cfm?doc_id=2496
  3. Service Level Agreement Complexity: Processing Concerns for Standalone and Aggregate SLAs: http://arxiv.org/abs/1407.7257
  4. SLA-Enabled Infrastructure Management (with Apache Tashi): http://pd.zhaw.ch/publikation/upload/206042.pdf
  5. PSLA: A PaaS Level SLA Description Language: http://dl.acm.org/citation.cfm?id=2624614
Joachim Bauernberger

Passionate about Open Source, GNU/Linux and Security since 1996. I write about future technology and how to make R&D faster. Expatriate, Entrepreneur, Adventurer and Foodie, currently living near Nice, France.

QoS considerations for Carrier Grade Clouds: location, location, location.

A Carrier-Grade Cloud (CGC) is a cloud platform suitable for deployment, with stringent Availability, Reliability, QoS and Security requirements which are fundamental for Real-Time applications in many domains such as Telecoms, Automotive, Banking (Trading) sector.

In our previous post we looked at the security requirements paramount to the design of such systems. Now lets see how Network Equipment Providers (NEP) and mobile operators can differentiate their offering and meet the CGC stringent Quality of Service (QoS) demands.

The challenge with QoS in the cloud:

Qos is the ability to provide different priority to different applications, users, or data flows, or to guarantee a certain level of performance. The technical problems behind guaranteeing QoS are non-trivial for all but the most basic “best-effort” guarantees. Resource capacity planning is Non-deterministic Polynomial-time hard.

QoS requirements can’t be met with SLA’s and contractual enforcement alone, but must be a “hard” requirement within the technical design of such systems. These QoS constraints should be more than best-effort and formerly drafted into an SLA using templates such as the Web Services Agreement and made discoverable by the applications.

Network traffic represents the bulk of the operators cost, therefore maintaining a cloud infrastructure in one central location would be unsustainable. The way forward is a distributed location aware cloud infrastructure. Here operators are best positioned since they own the pipes and datacenters and can rely on their physically distributed network infrastructure to minimize routing costs and latency.

Traditional cloud computing is location-agnostic. But to meet QoS demands in a CGC a balance must be found between:

  1. reducing the cost of computation and storage (which comes at the expense of the network), and
  2. reducing the cost of networking (which increases requirements for local storage and processing)

The key differentiator to all other off-the-shelf services for QoS, is to provide ways for managing locality and fine grained control of where data is placed. Data should be stored where it is used coupled with intelligent caching and replication to meet the applications requirement towards network performance. Fetching data from the network should be avoided and instead allow computation to take place where the data is and by making computation power available at strategic locations.

CGC orchestration must therefore take a holistic approach, not only considering the management of storage and computation, but also the network.

That said, cloud security requirements prevent us from exposing the network-management or location to the upper application layer. Internal routing or priority of scheduling must be kept encapsulated within the PaaS/IaaS to avoid exposure of the internal topology to applications.

QoS parameter encapsulation:

For QoS we need an abstraction offered to upper application layers. A policy of constraints which allows the application (user of the service) to specify which type of priority it should receive, including:

  • VM-bandwidth, latency, redundancy
  • Co-Location proximity of VM’s (how close the location of a fail-over node should be placed)
  • Location of data. Possibly an intelligent mechanism to predict where data will be needed based on historical usage.
  • A management interface to model future usage scenarios and costs and what-if scenarios.

This is currently more of a research subject and also new territory for the cloud vendors. The good news is that Telcos have a head-start in this area because:

  1. QoS and Real-Time has always been their core business. since the early days of circuit switched routing to today’s packet-switched networks.
  2. Telcos are big enough to build their own internal offering instead of having to rely on reselling leased cloud services. This gives them better control and allows them to guarantee QoS levels and open up new revenue streams.
  3. Telcos control the network pipes and routing policy, hence are best positioned to build additional features and services and open the cloud to areas which previously had no use for it.

Conclusion:

Telcos can leverage their experience in billing metered services and further differentiate by billing based on the QoS constraints and resources actually used.

Network, location and QoS are major concern for CGC. Standards such as SDN/NFV will bring the network management to the cloud and enable the required fine-grained control over where data is processed, cached and stored within a Carrier Grade Cloud.

Further work is needed from standardization groups defining capacity-planning and resource scheduling in the cloud.

Further Reading:

  1. A QoS aware service mashup model for cloud computing applications [pdf]
  2. Managing Performance Interference Effects for QoS-Aware Clouds [pdf]
  3. NEC’s Carrier Grade Cloud Platform [pdf] (focus mainly on NFV)

 

Joachim Bauernberger

Passionate about Open Source, GNU/Linux and Security since 1996. I write about future technology and how to make R&D faster. Expatriate, Entrepreneur, Adventurer and Foodie, currently living near Nice, France.

Software-Defined Networking: A Comprehensive Survey

This is probably one of the most complete papers on the subject I have seen to date, published 02/06/2014 by by Diego Kreutz, Fernando M. V. Ramos, Paulo Verissimo, Christian Esteve Rothenberg, Siamak Azodolmolky, Steve Uhlig.

So if you’re new to SDN or need an in-depth look, grab yourself a fresh coffee and get started:

Software-Defined Networking (SDN) is an emerging paradigm that promises to change the state of affairs of current networks, by breaking vertical integration, separating the network’s control logic from the underlying routers and switches, promoting (logical) centralization of network control, and introducing the ability to program the network. The separation of concerns introduced between the definition of network policies, their implementation in switching hardware, and the forwarding of traffic, is key to the desired flexibility: by breaking the network control problem into tractable pieces, SDN makes it easier to create and introduce new abstractions in networking, simplifying network management and facilitating network evolution.

Today, SDN is both a hot research topic and a concept gaining wide acceptance in industry, which justifies the comprehensive survey presented in this paper. This paper starts by introducing the motivation for SDN, explains its main concepts and how it differs from traditional networking. Next, it presents the key building blocks of an SDN infrastructure using a bottom-up, layered approach. It provides an in-depth analysis of the hardware infrastructure, southbound and northbounds APIs, network virtualization layers, network operating systems, network programming languages, and management applications. It also looks at cross-layer problems such as debugging and troubleshooting.

In an effort to anticipate the future evolution of this new paradigm, the main ongoing research efforts and challenges of SDN are discussed. In particular, the design of switches and control platforms — with a focus on aspects such as resiliency, scalability, performance, security and dependability — as well as new opportunities for carrier transport networks and cloud providers. Last but not least, it analyzes the position of SDN as a key enabler of a software-defined environment.

by Diego Kreutz, Fernando M. V. Ramos, Paulo Verissimo, Christian Esteve Rothenberg, Siamak Azodolmolky, Steve Uhlig

Joachim Bauernberger

Passionate about Open Source, GNU/Linux and Security since 1996. I write about future technology and how to make R&D faster. Expatriate, Entrepreneur, Adventurer and Foodie, currently living near Nice, France.

Considerations for Carrier Grade Clouds: Security

A carrier-grade cloud is a cloud platform or infrastructure suitable for deployment, with stringent availability, reliability, QoS and security requirements which are fundamental for real-time applications in Telecoms, Automotive, Banking (Stock Trading) and the Energy (Smart-Grid) sector.

These industries operate under regulations that can prevent data sharing in many circumstances. Additionally, companies are reluctant to share market-sensitive data that could give away economic advantage. Data traversing the cloud must be secured, meeting all compliance requirements, so that they cannot be used by unintended parties. A carrier-grade cloud must provide adequate data sanitization and filtering capabilities to protect application or sensor data.

While such requirements are traditionally met with SLA’s, a mere contractual enforcement isn’t enough. A carrier-grade XaaS must heavily invest in technology in order to meet security and privacy requirements if their offer is to be accepted by customers. Carrier-grade clouds must provide built in mechanisms to enforce these requirements to meet regulatory industry requirements.

Identity and privacy-management in location aware systems, as well as communication and data security are some of the most sensitive areas with respect to cloud security. Generally these risks (aka. The Notorious 9) affect all cloud computing solutions and must be mitigated:

1. Data Breaches
2. Data Loss
3. Account Hijacking
4. Insecure APIs
5. Denial of Service
6. Malicious Insiders
7. Abuse of Cloud Services
8. Insufficient Due Diligence
9. Shared Technology Issues

Several additional aspects require special consideration in carrier grade environments:

  • in Telecommunications, handling and storage of data about the customer, is subject to different legal requirements varying from country to country (e.g. what data must be stored (for how long) or may not be stored, as well as where the storage is done. These requirements correlate with the fact that telco operators implementing clouds must concentrate on reducing the cost of network-latency/traffic and proximity within their design and hence treat computing+network assets as a single pool of resources.
  • SLA’s aren’t sufficient to maintain strict identity and trust management. Large organizations rely on the security of their communication infrastructure and may never be exposed to rogue employees of the cloud infrastructure provider.
  • Securing traceability requirements, imposed by regulatory bodies must undergo additional analysis to reduce risks of abuse.

Rather than enforced solely through SLA’s, a carrier grade IaaS/PaaS must provide mechanisms to manage the legal constraints on data accessibility, readability and localization, part of the system.

Identity and communication data must be as safe (or safer) as when managed in-house and isolated from other users of the infrastructure. While isolation is already a core aspect of virtualization, a carrier-grade implementation must guarantee to preserve it. Therefore carrier-grade architectures must guarantee that data at rest or snapshots of a VM image can never be decrypted without the presence of the customer.

Using compartmentalization and limiting computation of data to encrypted datasets (or its metadata) and moving computation of decrypted contents to a private cloud is one option but currently lack proper standardization. On the transport level, packets must be secured and tamper-resistant, preferably offering control of the routing-policies as a value added services to customer to let them opt-in/out to avoid data transversing jurisdictions with weak privacy laws or notorious for openly spying.

Carrier grade cloud services must fulfill independent certification requirements to guarantee providers can restore operations and data. To provide a coherent security framework suitable for a carrier clouds, we need a schema that defines different security requirements for each layer (networking, hardware, hypervisor, virtual machines, OS and middleware) and the interactions between them.

To fill these gaps in existing standardization and architecture requirements, additional policies are needed. But instead of reinventing the wheel, carrier-grade clouds should build on existing security standards targeted at cloud computing such as ISO270172, PCI-DSS, NIST, CSA, etc., and auditing standards (ISO27001) which seem to be completely missing until now for cloud-computing. Such a framework/schema should also consider telco specific requirements on data storage (location) and compliance with local legal requirements (on privacy, retention) or by consolidating the framework to adhere to the rules of the strictest jurisdiction.

Further reading:

GridCloud: http://www.cs.cornell.edu/Projects/gridcontrol/index.html#gridcloud
Isis2: https://isis2.codeplex.com/
ENISA: http://www.enisa.europa.eu/act/rm/files/deliverables/cloud-computing-risk-assessment/at_download/fullReport
MOFO Changes to SAS 70: http://media.mofo.com/files/Uploads/Images/101227-Changes-to-SAS.pdf

Joachim Bauernberger

Passionate about Open Source, GNU/Linux and Security since 1996. I write about future technology and how to make R&D faster. Expatriate, Entrepreneur, Adventurer and Foodie, currently living near Nice, France.

DevOps in Telecoms

A recent article on CA takes a look at how DevOps is making strides into many industries to improve the speed in how we deploy our solutions in Agile development environments. What surprised me was that in Telecoms the adoption of DevOps is much higher than generally believed and seems most willing to extend their existing agile methodologies. CA says:

“The clear leader here is telecoms. They are almost two-times more likely to be using DevOps (68%) than the aggregate (39%). In fact, a total of 88% of telecoms either are using it or plan to, versus an aggregate of 66%. It is clear that the telecoms industry is one of the most competitive, and the pressures to continuously deliver new products and services are enormous. The benefits of DevOps in terms of accelerating the pace of delivery provides a pretty compelling reason for telecoms to move forward aggressively with DevOps.”

Different challenge for vendors and operators

This seems accurate for ISP’s and mobile operators who deployed Agile methods years ago and now move to webRTC, web services, HTTP/REST in their battle to maintain profits in Value-Added-Services (VAS). Their playing field becomes more like IT with services running in a browser or easily virtualized in an OpenStack container. DevOps comes from IT (as did Agile). Hence it’s logical in industries that converge with Internet platforms and web-portals to use both DevOps tools and mindset.

devops

But this pure IT stack, found at nearly every operator and ISP is only a subset of “telecoms”, and doesn’t reflect the largest share of what the Telecoms industry “actually is”: namely network equipment providers building the equipment and infrastructure.

Apart from the biggest 4 vendors such as NSN, Ericsson, Alcatel-Lucent & Huawei, there are thousands of medium sized firms writing and deploying software using Agile principles. They moved from iterative-waterfall to Agile years ago, and albeit their huge size (compared to IT teams) they are no strangers to Continuous Integration, -and up to a point- even Continuous Deployment.

But unlike IT and Internet platforms, they don’t create a virtual service to be deployed somewhere in the cloud, nor can it be “continuously” patched in an Agile manner. They deliver hardware that may cost millions to commission and is maintained over years with strict SLA’s. So on a technical level, by using OpenStack, Puppet, Chef, Salt or other technologies DevOps isn’t going to do anything for the Telco guys.

When I first asked my former colleagues from my time working in SaaS in 2012 what DevOps actually was, the confusing answer by advocates was:

“To understand DevOps you need to remember that it isn’t a framework of tools, but a mindset”.

This made me curious, because nobody even in IT could provide a proper definition other than: “It will make you faster by improving the communication between development and integration/deployment departments”.

Be it Telecoms, Automotive or other hardware intensive industry, anyone forced to take a conservative route, will be unlikely to embrace DevOps by-the-book, even if Agile processes are already in place.

Those of you familiar with some of the funny images over at devops reactions, might get a good chuckle out of the idea of “DevOps in a Nuclear Power Plant“, or deploying “live upgrades to Medical Surgery Equipment“.

Amusing as this is, it’s the same in Telecoms: A mobile base station or eNodeB or any other node delivering a service to thousands of subscribers, has to work without faults. If your shiny new LTE/4G phone constantly drops voice calls – a service which has been working since 1948 (0G/MTS) or emergency services like 911 are disrupted then penalties to the vendor are severe (millions each hour).

Sending an engineer to the site and collect logs will take a day. A few more days to analyze, then deliver and roll out the correction which may or may not work. And by the time this is solved penalties are often in excess of the revenues earned from this deal. So finding faults as early as possible using “Design Failure Mode and Effects Analysis” (DFMEA) is the most important aspect in whether you remain profitable.

But don’t dismiss DevOps outside the Cloud and in conservative industries just yet. The DevOps “mindset” can still make you quicker than your competition! Before we go all philosophical, let’s look at how these bigger firms outside IT usually develop/deploy their products.

Use Case: ACME Corp

The below example uses ACME, a telecom equipment vendor, to illustrate a typical scenario, but it can easily be applied to any other company which builds complex and very expensive systems and where the product isn’t just SaaS but delivered with thousands of moving parts. ACME could equally be an OEM automotive manufacturer, an aerospace company, a power-plant delivering smart-grid solutions or a firm offering Industrial Automation.

ACME is a multi-billion dollar business and develops network components for mobile operators. One of their R&D projects currently keeps more than 1000 engineers busy developing away to make this world a better place. In the last 10 years ACME successfully moved from “iterative” (Waterfall) development methods to Agile. Their engineers now organize themselves as cross-functional micro-teams which are assembled/disassembled in an ad-hoc manner working on various chunks of the product. Not everyone touches everything within the code-repository but they look at code & ownership in the context of its features and impacts on the system.

Competences are spread across different geographic sites with a good mix and balance of skills. Some teams are closer to the hardware while others work higher on the technology stack. Strict quality guidelines ensure that whenever they submit a change to the software repository, immediate automated feedback is given by their continuous integration (CI) system to inform when developers broke existing functionality. In this first instance the CI executes mainly unit-tests but also more elaborate System-Component-Tests which verify the final binary with more complex scenarios and even simulating the messages the binary would later handle on the real hardware. When faults are raised by the CI they either deliver a correction immediately or roll-back that change. This way the overall content always has a very recent working and testable version of the product or its sub-components.

Once changes or new features were delivered, and provided nothing breaks, their code is automatically “promoted” and released to downstream integration and test stages.

The guys (in downstream departments) then (cherry-)pick a recent version from our upstream CI but also align and coordinate deliveries from other departments which have their own CI and contribute to the final product. They then integrate these versions further into the real hardware platform which also differs depending on applications (MCU, DSP, etc …). E.g. some departments would deliver the kernel and abstraction layer for the OS, others deliver the Layer-1 (Mac and Physical Layer) or Layer-2 (Forwarding-Plane) in the OSI model and then there are those teams providing actual functionality to handle messages for radio or core network interfaces: User-Plane, Control-Plane, OAM. This downstream team has their own CI system and repository to version and store their test-cases, and then in turn “promote” and release whatever came from upstream after their own tests passed, to the next teams. This too is done in a mostly automated manner and human intervention is only required in order to analyze when the pipeline blocks up.

You might find 4 or more of such test/integration instances, all organized by their own management layers. Eventually the initial code and final binary package reaches the real hardware where end-to-end system tests and inter-operability (IOT) tests can be conducted before moving on to field-tests and eventually customer-deployments.

Everyone is happy! That is until faults come cascading back up from downstream, and analysis will eat up a lot of the resources of the overloaded developers.

  • Which software package version has the fault?

  • How many branches have we already created in the meantime which now have the same faults?

  • Where do we have to merge the corrections to?

  • Who else will come back from another department reporting the same error?

  • But there is already a correction in version control so why wasn’t this one used downstream?

  • How can we ask developers to deal with all this overhead when they are meant to work on a tight deadline to churn out the next features in the sprint?

Who will deal with these questions if we’re unable to directly route ever fault directly to the developer?

worked fine in dev

Fault Manager to the rescue!

But wait! Now we have an important customer trial (actually we always have one of those) which means we have many of these must-fix-immediately-super-critical-do-not-wait faults. And we need developers to correct it right now. Yes NOW … even if we don’t know yet who the culprit is, who committed this hideous crime. And since we fault managers are also overloaded, we don’t have time to analyze any of this.

If only we had some dedicated people who could look at the logs and identify which developers must deliver the correction!

Pre-Analysis to the rescue!

And just in a nick of time, our hero managed to analyze the 10 GB of logs and identified the usual suspects, confirmed it wasn’t a bug in the actual tests, and forwarded the fault to the right team, who then delivered the required correction tout suite.

Though this is clearly not the end of the story! Since by now the fault has propagated into products which had already been frozen and therefore MUST NOT be corrected unless there is an official request from the customer.

Given the official delivery of the faulty code was made 3 weeks ago, this issue will raise its ugly head a few more times in months to come: and each time on a different branch popping up as a new bug in the bug-reporting-tool with a different set of trace files.

batman_robin

So clearly our heroes will be busier than Batman and Robin for years to come!

What happened? Didn’t agile make ACME faster?

Having abolished their individual code ownership and implemented Agile internally within their team, ACME Corp gained tremendous speed over their competitors, churning out well tested changes and new features every few hours and delivering several times a day instead of once a week. Fail quickly is the new mantra.

But by doing so another bottleneck surfaced: the company had internally committed to Agile methods, but every department still worked on their own terms, maintaining their status quo and continued to treat other departments as external collaborators.

Back to their workflow: If you look closely at how the product delivery chain “cascades”, often over 5 or more integration & test departments, you still see a waterfall! And that waterfall is one costly and resource hungry administrative nightmare, quickly eating away the developers time, as they have to constantly provide clarification to dedicated coordinators, whose main job consists of steering the communication about the faults into the correct channels. Frustrating.

And this is where Agile stops and DevOps can help you pierce those remaining silos.

What DevOps can bring to the table in industries that are “close to the metal”, isn’t a new tool to solve all your problems. And neither was Agile!

Thinking in “DevOps terms” means integrating your downstream test environments / operations and bringing them closer to developers.

In fact from the downstream departments’ perspective not much has changed mainly because integration for them is technically not as easy due to lack of virtualization and operating closer to the hardware (using slow and unreliable JTAG/USB interfaces or debugging communications on SRIO, ePCI or dealing with highly specialized Telco specific interfaces such as OBSAI/CPRI) often requiring manual reboots of test-PC’s which use purpose-built drivers etc. So they still get a baseline at regular intervals which is manually loaded into the test-environment and then they provide feedback upstream. To make it clearer the teams conducting integration have a totally different skillset which is far from the activities of what a developer does. Here the focus is closer on understanding the message flow of the 3GPP specs end-to-end rather than churning out individual parts in C/C++. When Integration speaks about a “feature” it usually touches several (or all) of the involved teams.

The further you move downstream the more you’ll find a lack of scripting know-how which comes natural in your upstream layers. Integration might select only a few of these baselines randomly or “cherry-pick” whatever looks most promising.

So even if our developers in the different lines churn out features at maximum capacity, “downstream” continues to live isolated from the rest of the upstream departments.

And it gets worse the further you move downstream away from developers:

  • Each department inventing their own test systems because of their unique needs (some justified some not).

  • Many writing overlapping test cases and re-test what was already verified. Not that double checking is a bad thing, but writing/maintaining the same test code several times is a waste. There is a massive overlap, and engineers creating and solving similar problems in every department.

What practical steps are there to move towards DevOps?

A DevOps “mindset” will help you cut down the technical barriers between these cascading departments, so that automated deployments become possible. Add API’s that make the whole delivery chain transparent from a technical point of view – all the way to the target hardware.

The good news is that you don’t have to start doing it all at once, as you did when initially migrating to Agile 10 years ago! Instead start with a gap-analysis conducted by either an internal architect -let’s call this person a consolidation engineer – and check every one of your test/integration departments to identify overlap and ways to automate your deployments’ API’s/interfaces.

Once you have a clear picture, figure out how you can break your internal silos and improve intra-department communication. How would you do that? Simple: you already did this once when you introduced Agile in your development teams! (and yes back then some people probably resisted – even left – and it won’t be easier this time either)

  1. Now take the next step and have some of your developers spend one or two days a week in what was until now the next downstream department. And move some of these guys into development for few days a week.

  2. You might want to think about incentives for doing so and reward people willing to work across these borders. Consider them as ambassadors in your company who pierce through your silos. Before you know it, boundaries will have become blurry and that silo is cracking! Natural cross-pollination of ideas will start happening – from bottom up!

  3. Developers now see the effects of their code-changes within just a few hours and on the actual target hardware. This makes them a more relevant part of your big “organization-machine”, and also helps them to identify themselves deeper with the product you sell.

  4. Trust your test-coverage! In case something fails downstream you must have communication between these departments to figure out why your test coverage couldn’t detect the problem early. Any error must which made it downstream must be covered in future tests on all branches. Limiting the number of active branches reduces “context-switches” for developers and complexity of your system. If your test coverage is so bad that your only way of ensuring quality is to isolate and freeze the branch then improve here first. Deciding that a new branch must be created is usually coming from top-down (QA managers) but usually ignores the fact that this is the most costly option and it never scales.

Will this make the role as “Fault-Manager” and “Pre-Analysis” obsolete?

I’d like to picture these positions like an “Interim Manager” who comes in during time of change and supports with special skills bridging the worlds of R&D and Operations. Big firms doing Agile without DevOps won’t be able to deliver services without them. But once your silos crack these tasks need to be redefined.

How long will it take to get there?

You need support from the bottom-up and the top-down because breaking silos and doing DevOps affects all links in the delivery chain. From my personal experience during migration from Waterfall to Agile and also from many technical interviews I conducted as a Telco-recruiter, the time it takes to move from waterfall to Agile is 1-3 years. I’d predict another 1-3 years for DevOps. The biggest challenge in large organizations isn’t technology but politics and people, or as Gerald M. Weinberg once famously quoted:

“No matter how technical it looks at first, it’s always a people’s problem.”

Conclusion:

  • Doing DevOps in industries outside XaaS and Web is not at all too different. We deal with more stakeholders and more complex and conservative structures.

  • It will take longer to get everyone onboard. But if you zoom out and focus on the interfaces (e.g. how these silos communicate) then “a silo is a silo” and there is no difference to IT. It’s just a larger scale.

  • These firms already took the idea of Agile from IT and scaled them to their needs – mostly successfully. Individual silos are already operating at max speed and efficiency and little can be improved internally (most of them operate like pressure-cookers). DevOps is the missing-link and logical next step as it will reduce the increasing friction between these silos.

  • Once you move to DevOps it will increase cross-functional skills and lead to better understanding of problems other stakeholders are facing.

  • Finally it allows solutions such as putting more trust into your test-suite instead of utilizing expensive Q&A branches or interchanging people regularly between up/downstream. Solutions which have always felt right, but until then politically impossible to implement.

Please share your thoughts below.

Joachim Bauernberger

Passionate about Open Source, GNU/Linux and Security since 1996. I write about future technology and how to make R&D faster. Expatriate, Entrepreneur, Adventurer and Foodie, currently living near Nice, France.