Process and Organization

To make a change, your company has to go through a decision cycle, as illustrated in the figure that follows. Someone must sense that a need exists. Someone must decide that a feature will fit that need and that it’s worth doing...and how quickly it’s worth doing. And then someone must act, building the feature and putting it to market. Finally, someone must see whether the change had the expected effect, and then the process starts over. In a small company, this decision loop might involve just one or two people. Communication can be pretty fast, often just the time it takes for neurons to fire across the corpus callosum. In a larger company, those responsibilities get diffused and separated. Sometimes an entire committee fills the role of “observer,” “decider,” or “doer.”

images/adaptation/deming_pdca_cycle.png

The time it takes to go all the way around this cycle, from observation to action, is the key constraint on your company’s ability to absorb or create change. You may formalize it as a Deming/Shewhart cycle,[84] as illustrated in the previous figure; or an OODA (observe, orient, decide, act) loop,[85] as shown in the figure that follows; or you might define a series of market experiments and A/B tests. No matter how you do it, getting around the cycle faster makes you more competitive.

 

images/adaptation/boyd_ooda_loop.png

 

This need for competitive maneuverability drives the “fail fast” motto for startups. (Though it might be better to describe it as “learn fast” or simply “adapt.”) It spurs large companies to create innovation labs and incubators.

Speed up your decision loop and you can react faster. But just reacting isn’t the goal! Keep accelerating and you’ll soon be able to run your decision loop faster than your competitors. That’s when you force them to react to you. That’s when you’ve gotten “inside their decision loop.”

Agile and lean development methods helped remove delay from the “act” portion of the decision loop. DevOps helps remove even more delay in “act” and offers tons of new tools to help with “observe.” But we need to start the timer when the initial observations are made, not when the story lands in the backlog. Much time passes silently before a feature gets that far. The next great frontier is in the “deciding” phase.

 

In the sections that follow, we’ll look at some ways to change the structure of your organization to speed up the decision loop. We’ll also consider some ways to change processes to move from running one giant decision loop to running many of them in parallel. Finally, we’ll consider what happens when you push automation and efficiency too far.

Platform Team

In the olden days, a company kept its developers quarantined in one department. They were well isolated from the serious business of operations. Operations had the people who racked machines, wired networks, and ran the databases and operating systems. Developers worked on applications. Operations worked on the infrastructure.

The boundaries haven’t just blurred, they’ve been erased and redrawn. That began before we even heard the word “DevOps.” (See The Fallacy of the “DevOps Team”.) The rise of virtualization and cloud computing made infrastructure programmable. Open source ops tools made ops programmable, too. Virtual machine images and, later, containers and unikernels meant that programs became “operating systems.”

When we look at the layers from Chapter 7, Foundations, we see the need for software development up and down the stack. Likewise, we need operations up and down the stack.

What used to be just infrastructure and operations now rolls in programmable components. It becomes the platform that everything else runs on. Whether you’re in the cloud or in your own data center, you need a platform team that views application development as its customer. That team should provide API and command-line provisioning for the common capabilities that applications need, as well as the things we looked at in Chapter 10, Control Plane:

  • Compute capacity, including high-RAM, high-IO, and high-GPU configurations for specialized purposes (The needs of machine learning and the needs of media servers are very different.)

  • Workload management, autoscaling, virtual machine placement, and overlay networking

  • Storage, including content addressable storage (for example, “blob stores”) and filesystem-structured storage

  • Log collection, indexing, and search

  • Metrics collection and visualization

  • Message queuing and transport

  • Traffic management and network security

  • Dynamic DNS registration and resolution

  • Email gateways

  • Access control, user, group, and role management

It’s a long list, and more will be added over time. Each of these are things that individual teams could build themselves, but they aren’t valuable in isolation.

One important thing for the platform team is to remember they are implementing mechanisms that allow others to do the real provisioning. In other words, the platform team should not implement all your specific monitoring rules. Instead, this team provides an API that lets you install your monitoring rules into the monitoring service provided by the platform. Likewise, the platform team doesn’t built all your API gateways. It builds the service that builds the API gateways for individual application teams.

You might buy—or more likely download—a capital-P Platform from a vendor. That doesn’t replace the need for your own platform team, but it does give the team a massive head start.

The platform team must not be held accountable for application availability. That must be on the application teams. Instead, the platform team must be measured on the availability of the platform itself.

The platform team needs a customer-focused orientation. Its customers are the application developers. This is a radical change from the old dev/IT split. In that world, operations was the last line of defense, working as a check against development. Development was more of a suspect than a customer! The best rule of thumb is this: if your developers only use the platform because it’s mandatory, then the platform isn’t good enough.

Painless Releases

The release process described in Chapter 12, Case Study: Waiting for Godot, rivals that of NASA’s mission control. It starts in the afternoon and runs until the wee hours of the morning. In the early days, more than twenty people had active roles to play during the release. As you might imagine, any process involving that many people requires detailed planning and coordination. Because each release is arduous, they don’t do many a year. Because there are so few releases, each one tends to be unique. That uniqueness requires additional planning with each release, making the release a bit more painful—further discouraging more frequent releases.

Releases should about as big an event as getting a haircut (or compiling a new kernel, for you gray-ponytailed UNIX hackers who don’t require haircuts). The literature on agile methods, lean development, continuous delivery, and incremental funding all make a powerful case for frequent releases in terms of user delight and business value. With respect to production operations, however, there’s an added benefit of frequent releases. It forces you to get really good at doing releases and deployments.

A closed feedback loop is essential to improvement. The faster that feedback loop operates, the more accurate those improvements will be. This demands frequent releases. Frequent releases with incremental functionality also allow your company to outpace its competitors and set the agenda in the marketplace.

As commonly practiced, releases cost too much and introduce too much risk. The kind of manual effort and coordination I described previously is barely sustainable for three or four releases a year. It could never work for twenty a year. One solution—the easy but harmful one—is to slow down the release calendar. Like going to the dentist less frequently because it hurts, this response to the problem can only exacerbate the issue. The right response is to reduce the effort needed, remove people from the process, and make the whole thing more automated and standardized.

In Continuous Delivery [HF10], Jez Humble and Dave Farley describe a number of ways to deliver software continuously and at low risk. The patterns let us enforce quality even as we crank the release frequency up to 11. A “Canary Deploy” pushes the new code to just one instance, under scrutiny. If it looks good, then the code is cleared for release to the remaining machines. With a “Blue/Green Deploy,” machines are divided into two pools. One pool is active in production. The other pool gets the new deployment. That leaves time to test it out before exposing it to customers. Once the new pool looks good, you shift production traffic over to it. (Software-controlled load balancers help here.) For really large environments, the traffic might be too heavy for a small pool of machines to handle. In that case, deploying in waves lets you manage how fast you expose customers to the new code.

These patterns all have a couple of things in common. First, they all act as governors (see Governor) to limit the rate of dangerous actions. Second, they all limit the number of customers who might be exposed to a bug, either by restricting the time a bug might be visible or by restricting the number of people who can reach the new code. That helps reduce the impact and cost of anything that slipped past the unit tests.

Service Extinction

Evolution by natural selection is a brutal, messy process. It wastes resources profligately. It’s random, and changes fail more often than they succeed. The key ingredients are repeated iteration of small variations with selection pressure.

On the other hand, evolution does progress by incremental change. It produces organisms that are more and more fit for their environment over time. When the environment changes rapidly, some species disappear while others become more prevalent. So while any individual or species is vulnerable in the extreme, the ecosystem as a whole tends to persist.

We will look at evolutionary architecture in Evolutionary Architecture. It attempts to capture the adaptive power of incremental change within an organization. The idea is to make your organization antifragile by allowing independent change and variation in small grains. Small units—of technology and of business capability—can succeed or fail on their own.

Paradoxically, the key to making evolutionary architecture work is failure. You have to try different approaches to similar problems and kill the ones that are less successful.

Take a look at the figure. Suppose you have two ideas about promotions that will encourage users to register. You’re trying to decide between cross-site tracking bugs to zero in on highly interested users versus a blanket offer to everyone. The big service will accumulate complexity faster than the sum of two smaller services. That’s because it must also make decisions about routing and precedence (at a minimum.) Larger codebases are more likely to catch a case of “frameworkitis” and become overgeneralized. There’s a vicious cycle that comes into play: more code means it’s harder to change, so every piece of code needs to be more generalized, but that leads to more code. Also, a shared database means every change has a higher potential to disrupt. There’s little isolation of failure domains here.

images/adaptation/complected-promotions-service.png

Instead of building a single “promotions service” as before, you could build two services that can each chime in when a new user hits your front end. In the next figure, each service makes a decision based on whatever user information is available.

 

images/adaptation/simple-promotion-services.png

 

Each promotion service handles just one dimension. The user offers still need a database, but maybe the page-based offers just require a table of page types embedded in the code. After all, if you can deploy code changes in a matter of minutes, do you really need to invest in content management? Just call your source code repo the content management repository.

It’s important to note that this doesn’t eliminate complexity. Some irreducible—even essential—complexity remains. It does portion the complexity into different codebases, though. Each one should be easier to maintain and prune, just as it’s easier to prune a bonsai juniper than a hundred-foot oak. Here, instead of making a single call, the consumer has to decide which of the services to call. It may need to issue calls in parallel and decide which response to use (if any arrive at all). One can further subdivide the complexity by adding an application-aware router between the caller and the offer services.

One service will probably outperform the other. (Though you need to define “outperform.” Is it based just on the conversion rate? Or is it based on customer acquisition cost versus lifetime profitability estimates?) What should you do with the laggard? There are only five choices you can make:

  1. Keep running both services, with all their attendant development and operational expenses.

  2. Take away funding from the successful one and use that money to make the unsuccessful one better.

  3. Retool the unsuccessful one to work in a different area where it isn’t head-to-head competing with the better one. Perhaps target a different user segment or a different part of the customer life cycle.

  4. Delete the unsuccessful one. Aim the developers at someplace where they can do something more valuable.

  5. Give up, shut down the whole company, and open a hot dog and doughnut shop in Fiji.

The typical corporate approach would be #1 or #2. Starve the successful projects because they’re “done” and double down on the efforts that are behind schedule or over budget. Not to mention that in a typical corporation, shutting down a system or service carries a kind of moral stigma. Choice #3 is a better approach. It preserves some value. It’s a pivot.

You need to give serious consideration to #4, though. The most important part of evolution is extinction. Shut off the service, delete the code, and reassign the team. That frees up capacity to work on higher value efforts. It reduces dependencies, which is vital to the long-term health of your organization. Kill services in small grains to preserve the larger entity.

As for Fiji, it’s a beautiful island with friendly people. Bring sunscreen and grow mangoes.

Team-Scale Autonomy

You’re probably familiar with the concept of the two-pizza team. This is Amazon founder and CEO Jeff Bezos’s rule that every team should be sized no bigger than you can feed with two large pizzas. It’s an important but misunderstood concept. It’s not just about having fewer people on a team. That does have its own benefit for communication.

A self-sufficient two-pizza team also means each team member has to cover more than one discipline. You can’t have a two-pizza team if you need a dedicated DBA, a front-end developer, an infrastructure guru, a back-end developer, a machine-learning expert, a product manager, a GUI designer, and so on.

The two-pizza team is about reducing external dependencies. Every dependency is like one of the Lilliputian’s ropes tying Gulliver to the beach. Each dependency thread may be simple to deal with on its own, but a thousand of them will keep you from breaking free.

Dependencies across teams also create timing and queuing problems. Anytime you have to wait for others to do their work before you can do your work, everyone gets slowed down. If you need a DBA from the enterprise data architecture team to make a schema change before you can write the code, it means you have to wait until that DBA is done with other tasks and is available to work on yours. How high you are on the priority list determines when the DBA will get to your task.

The same goes for downstream review and approval processes. Architecture review boards, release management reviews, change control committees, and the People’s Committee for Proper Naming Conventions...each review process adds more and more time.

This is why the concept of the two-pizza team is misunderstood. It’s not just about having a handful of coders on a project. It’s really about having a small group that can be self-sufficient and push things all the way through to production.

Getting down to this team size requires a lot of tooling and infrastructure support. Specialized hardware like firewalls, load balancers, and SANs must have APIs wrapped around them so each team can manage its own configuration without wreaking havoc on everyone else. The platform team I discussed in Platform Team, has a big part to play in all this. The platform team’s objective must be to enable and facilitate this team-scale autonomy.

Beware Efficiency

“Efficiency” sounds like it could only ever be a good thing, right? Just trying telling your CEO that the company is too efficient and needs to introduce some inefficiency! But efficiency can go wrong in two crucial ways that hurt your adaptability.

Efficiency sometimes translates to “fully utilized.” In other words, your company is “efficient” if every developer develops and every designer designs close to 100 percent of the time. This looks good when you watch the people. But if you watch how the work moves through the system, you’ll see that this is anything but efficient. We’ve seen this lesson time and time again from The Goal [Gol04], to Lean Software Development [PP03], to Principles of Product Development Flow [Rei09], to Lean Enterprise [HMO14] and The DevOps Handbook [KDWH16]: Keep the people busy all the time and your overall pace slows to a crawl.

A more enlightened view of efficiency looks at the process from the point of view of the work instead of the workers. An efficient value stream has a short cycle time and high throughput. This kind of efficiency is better for the bottom line than high utilization. But there’s a subtle trap here: as you make a value stream more efficient, you also make it more specialized to today’s tasks. That can make it harder to change for the future.

We can learn from a car manufacturer that improved its cycle time on the production line by building a rig that holds the car from the inside. The new rig turned, lifted, and positioned the car as it moved along the production line, completely replacing the old conveyor belt. It meant that the worker (or robot) could work faster because the work was always positioned right in front of them. Workers didn’t need to climb into the trunk to place a bolt from the inside. It reduced cycle time and had a side effect of reducing the space needed for assembly. All good, right? The bad news was that they then needed a custom rig for each specific type of vehicle. Each model required its own rig, and so it became more difficult to redesign the vehicle, or switch from cars to vans or trucks. Efficiency came at the cost of flexibility.

This is a fairly general phenomenon: a two-person sailboat is slow and labor-intensive, but you can stop at any sand bar that strikes your fancy. A container ship carries a lot more stuff, but it can only dock at deep water terminals. The container ship trades efficiency for flexibility.

Does this happen in the software industry? Absolutely. Ask anyone who relies on running builds with Visual Studio out of Team Foundation Server how easily they can move to Jenkins and Git. For that matter, just try to port your build pipeline from one company to another. All the hidden connections that make it efficient also make it harder to adapt.

Keep these pitfalls in mind any time you build automation and tie into your infrastructure or platform. Shell scripts are crude, but they work everywhere. (Even on that Windows server, now that the “Windows Subsystem for Linux” is out of beta!) Bash scripts are that two-person sailboat. You can go anywhere, just not very quickly. A fully automated build pipeline that delivers containers straight into Kubernetes every time you make a commit and that shows commit tags on the monitoring dashboard will let you move a lot faster, but at the cost of making some serious commitments.

Before you make big commitments, use the grapevine in your company to find out what might be coming down the road. For example, in 2017 many companies are starting to feel uneasy about their level of dependency on Amazon Web Services. They are edging toward multiple clouds or just straight-out migrating to a different vendor. If your company is one of them, you’d really like to know about it before you bolt your new platform onto AWS.

Summary

Adaptability doesn’t happen by accident. If there’s a natural order to software, it’s the Big Ball of Mud.[86] Without close attention, dependencies proliferate and coupling draws disparate systems into one brittle whole.

Let’s now turn from the human side of adaptation to the structure of the software itself.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset