Chapter 15. Using Less: Usage Optimization

In this chapter we discuss one of the toughest parts of the FinOps lifecycle. Usage reduction can take lots of effort and time to implement culturally. If cost-saving quick wins are your goal, you will find that purchasing commitment-based discounts—which we cover in Chapter 17—is a faster path to cost reductions. Unlike buying reservations, which can bring dramatic cost savings without any engineering action, usage reduction requires engineers to insert optimization work into their sprints. This isn’t to say there’s no low-hanging fruit—just that it will take more effort to get it done.

Cloud service providers sell you on the idea that you pay for what you use. However, a more accurate statement is that you pay for the resources you provision, whether you use them or not. In this chapter, we’re going to look at the resources deployed into your cloud accounts that are either not being used or not being used efficiently.

The Cold Reality of Cloud Consumption

When you start shining a light on usage optimization, don’t be surprised if you hear a version of this conversation:

FinOps practitioner: “Hey, it looks like you’re not using these resources over here. Do you need them?”

Team member: “Oh, those are still there? I forgot about them! I’ll delete them now.”

At this point, you may, understandably, slap a frustrated hand to your forehead. You’ll quickly realize that if you had simply had this conversation sooner, you might have saved months of paying for that resource.

But it’s not that simple. When you’re operating at scale in the cloud, you can almost guarantee that there’ll be some underutilized or forgotten resources. When faced with the seemingly limitless resource choices in cloud, many engineers will tend to go for the larger resource option. Making that choice can keep them from getting paged in the middle of the night.

However, it can be difficult to know which resources need reviewing. As we’ve mentioned, it requires that teams have the right metrics to understand their application’s requirements. Teams must also understand any business reasons for why resources are sized and operated the way they are. FinOps is so effective across organizations because it requires that all groups around an organization follow a consistent workflow and work together in putting the recommendations into action.

However, when you do engage usage reduction workflows, you can realize savings where you previously have been wasting money on oversized resources, and you can prevent waste from building up inside your cloud usage into the future. We’ll look at what constitutes waste, ways to detect the resources needing investigation, methods to reduce usage, how to get your teams to implement changes, and the correct means to track your progress.

Usage reduction at scale requires a cultural shift. It’s not a one-time fix but a continuous cycle of picking properly sized resources and eliminating overprovisioning. Usage reduction requires engineers to think of cost like they think of memory or bandwidth: as another deployment KPI. And they’ll see very soon how thinking about cost as an efficiency metric is worth the effort.

Halving the size of a resource will generally save 50% of the cost—or in the case of an unused resource such as an unattached block storage volume or idle instance, 100% of the cost. One of the benefits of the variable, on-demand nature of the cloud is that you are free to resize any resource—or to stop using it entirely—whenever you desire. Further savings are possible on correctly sized resources that you use consistently with rate reduction options, which we cover in Chapter 17.

Initially, usage reduction tends to be retroactive. Teams go through and clean up their infrastructure to ensure a fully utilized set of resources. In more mature practices, engineers are proactively considering cost in architecture reviews before deployment and actively considering cost when choosing cloud resources initially. Later phases bring the rearchitecting of applications to use more cloud native offerings.

The goal isn’t to make the perfect decision. It’s to make the best decision possible given the context and data, and then reevaluate often. A well-sized resource won’t stay that way forever, as code and load evolve over time.

Where Does Waste Come From?

What we call waste is any paid usage or portion of paid usage that could have been avoided by resizing or reducing the number of your resources to better fit the needs of your application.

When cloud resources are first deployed, often their capacity requirements aren’t fully known or well understood. That’s true for almost everyone in those early days of deployment.

This uncertainty leads to the teams overprovisioning capacity to ensure there won’t be any performance issues. Overprovisioning resources when initially deploying isn’t a bad thing. The beauty of cloud is that it gives you the ability to deploy quickly and then make adjustments along the way. What you want to avoid is deploying resources and not monitoring them for overprovisioning. That’s when you find yourself paying more than you should for resources over a long period.

Wastage grows as teams neglect to maintain their resources based on utilization metrics. It is essential to continuously monitor your resources for oversizing. That’s because a service that’s using its resources correctly today may become overallocated tomorrow after the deployment of more efficient service code. Even changes in customer behavior can result in reduced capacity requirements.

Usage Reduction by Removing/Moving

Research shows that there are two types of cloud teams: those that forget about some of their resources, and those that fib about forgetting about some of their resources. In other words, it is very common for resources to be forgotten about and left in cloud accounts. Don’t assume bad intent or carelessness when you find the same in your company. The job of engineering has many competing priorities, and automated tooling to detect waste is not always prioritized by leadership. In other words, it happens to the best of us.

Tip

Find waste by looking for areas with very stable costs and that use dated resources. “Old and unchanging” is not how the cloud should operate. If it’s been sitting there untouched for a long time, it likely can be optimized.

Simply telling teams not to forget about resources doesn’t cover all the bases. It’s possible a resource was created by automation that wasn’t configured to remove it when instructed to do so. Or the resource was intentionally left there by the team until an unspecified later date, but no follow-up action was taken as priorities shifted. Additionally, some resources can be configured to remain when their parent resource gets deleted, like volumes attached to a server, and when deleting some resources, the cloud service provider can automatically create a snapshot of the data that sticks around whether you need it or not.

A lost or forgotten resource is the easiest type for teams to handle. Usage reduction in such cases can be as simple as deleting the resource, saving 100% of the cost of that resource.

Another conversation we often hear is around the need to keep data stored inside a resource. Usually, teams have compliance reasons to retain data for very long periods.

Even if you need to retain an unused volume for this purpose, there are ways to reduce what you are paying. For example, cloud service providers offer multiple storage tiers, and paying for data in higher-price storage offerings makes no sense when you can move it to a lower-price “cold storage” resource like Azure Archive or AWS Glacier storage.

Consider creating a data retention policy that makes it clear to teams what data they need to keep and for how long. Once a retention policy is in place, teams can automatically move data to different storage offerings and delete it when appropriate.

Usage Reduction by Resizing (Rightsizing)

To resize a resource, you need visibility into how much of the resource is utilized. That visibility must include CPU usage, memory utilized, network throughput, and disk utilization.

When rightsizing, you aren’t just optimizing for cost savings. You also need to make sure that you don’t impact the services that teams are operating. The primary goal of the teams when managing services is to ensure that the services themselves do not run out of required operating capacity. As a result, it’s not uncommon for them to resist the idea of resizing resources, especially those supporting production applications.

If the teams have to stop working on what they are doing to investigate resizing their resources, there can be real impacts to delivery schedules. You should consider these impacts when deciding whether teams should focus on resizing resources to reduce costs. Often there are a handful of high-cost resources and a long tail of small resources that would save very little. We recommend a cutoff point (a minimum savings amount) under which rightsizing recommendations can be ignored. This threshold should be reviewed often to be sure it remains sensible. The goal is to make sure that time spent on rightsizing results in material savings for the business.

To prevent concerns about impact to resources or automation while implementing resizing changes, it’s important to understand the role the FinOps team plays in rightsizing. FinOps is about collaboration between teams and should not be performed independently by the FinOps practitioner. This is where having a conversation is crucial, because it’s usually impossible to infer the method teams used in determining the size of their resources just by looking at metrics alone.

The central FinOps team is there to operate the automated analytics into resource usage, provide reports on what resources appear to be underutilized, and programmatically provide alternative, more efficient configurations. Teams are then given the recommendations for investigating the resources to identify any reasons that would prevent resizing them to realize the potential cost savings. It’s imperative that engineering organizations understand that the FinOps team is not intending to introduce performance risk, but focusing on safely reducing capacity where it offers no business benefit.

It’s essential to supply multiple recommendations that would fit the existing workload on the resource without affecting workload performance. Multiple recommendations give teams the opportunity to balance the risk involved in reducing the resource size against the potential savings. Native, open source, or third-party FinOps tooling can do the heavy lifting of providing these recommendations, and engineering teams can investigate further using application performance monitoring tools.

Tip

Rick Ochs, Principal Product Manager at AWS, shares that it is important to ensure that the rightsizing recommendations are high quality, helping engineering teams avoid hours of validation work to uncover potential risks, such as account quotas, attachment limits, or CPU speed differences. Offering rightsizing recommendations that are directionally accurate but simplistic can often lead to engineering teams spending more time on rejecting recommendations than the benefit of rightsizing can provide.

By using high-quality rightsizing recommendations—those that use more than just CPU average, but include I/O, throughput, memory, and don’t use peak or average metric—you can lower the amount of effort required and increase the number of recommendations that turn into material cost savings.

Every rightsizing recommendation is an opportunity to have a discussion with someone. You have to fully understand the overall effort to rightsize existing resources and ensure that new apps are rightsized from the beginning.

Benjamin’s story especially makes sense when you consider that during the time we built things in the data center, we intentionally made infrastructure oversized to allow the workloads on them to grow over the five to seven years they’d be in service. It was considered a standard best practice to oversize your physical hardware purchases to avoid the pain of having to repurchase hardware if your applications cause the system to run out of capacity only halfway through the hardware’s life span. This has caused a common outcome of very low single-digit average hardware utilization in physical data centers. If we lift and shift by matching exactly the CPU and memory that was in the data center, we are carrying over that oversizing without even knowing it. An alternative approach if you can’t look at utilization right away is to go a little smaller in the cloud by default, and address issues of performance as they come up. This approach can help lift and shift cost efficiency if engineering can easily identify performance issues but not yet see the cost.

Common Rightsizing Mistakes

Rightsizing is rarely as straightforward as it seems when first considered. There are a plethora of details to consider, and many common mistakes are repeated by those new to cloud optimization.

Relying on Recommendations That Use Only Averages or Peaks

When looking at the existing workload on a resource (like a server instance), it’s crucial to ensure that you’re not just looking at average and peak values of the metrics. If you use average or peaks, you can be misled.

It’s not a simple matter to take into account shifts and spikes in utilization and then apply a statistical model to enumerate the best-matched resource size. Most tools that we’ve seen use some kind of rule of thumb based only on utilization averages over a time series, and then recommend the tightest-fitting resource type. Other tools will recommend upsizes for any peak they see, no matter the reason.

For example, when you compare the two usage graphs in Figure 15-1, both have an average CPU statistic of 20% utilized over the hour. However, using the average CPU usage metric, you can’t tell whether the CPU was at 20% the whole time (as shown in the right graph) or up near max CPU for one-fifth of the time and near minimum for the remaining four-fifths (as shown in the left graph). Resizing these instances to a server instance with half the CPU capacity would impact the performance of the left instance during its peak CPU usage. It doesn’t matter if you look per day, per hour, or even per minute. Averages don’t tell the full story of what is occurring on the instance.

Graphs showing two CPU workloads: one has a short 90%+ peak and then remains at <10% for the rest of the hour, while the other is consistently at 20%
Figure 15-1. Graphs showing two CPU workloads: one has a short 90%+ peak and then remains at <10% for the rest of the hour, while the other is consistently at 20%

Using only a utilization average can lead to two outcomes. The most common scenario is that the engineer realizes the recommendation is completely bogus, and the cost-saving initiative ends. The team realizes that they’ve wasted a bunch of time, and their deployment is still costing more than it should.

The other scenario is far worse. A less experienced engineer may go ahead and take the recommended action. During quiet times, or even average usage times, the new resource configuration will likely hold up fine. But when a regular spike or shift occurs, things begin to unravel. Demands on the resource push it beyond what it’s capable of. Performance starts to degrade, and it’s even possible to see some sort of outage. Any potential cost saving becomes vastly outweighed by the risk to the business.

Sizing all of your workloads to max—or peak—can also be a mistake. In a typical month of enterprise VM usage, various activities will occur on these workloads: patching, maintenance, deployments, backups, and the like. While any of these activities can spike CPU usage, that does not mean you should immediately upsize, or trust a recommendation that always sizes your instance to cover a peak usage spike. Even normal OS reboots will drive a CPU utilization to 100%. That would be akin to throwing out your laptop and buying a faster one every time you boot up, because it hits 100% CPU utilization. Instead, using a sizing method that will remove a small number of peaks from the rightsizing is preferred, where transient spikes due to patching, maintenance, or other behaviors do not drive upsizes, but only utilization sustained for more than a reasonable 1% or 5% of the time over a week. Ideally, use rightsizing recommendations that account for this by using percentile calculations.

Imagine if you were using a rightsizing product that only keyed on peak usage, and then a major security vulnerability was discovered that required a large patch to be installed across your entire company’s fleet of instances. Your rightsizing recommendations would erroneously tell you to upsize everything.

Rick Ochs, Principal Product Manager at AWS

Failing to Rightsize Beyond Compute

Everyone loves to first focus energy on compute savings, but rightsizing needs to extend beyond compute so that you solve for the bigger cloud cost picture. You see utilization issues across the board, but two particular spots worth mentioning are with database servers such as RDS (Relational Database Service), managed SQL, Azure managed disk, Cloud SQL, and storage such as EBS (Elastic Block Store) and Google Cloud Persistent Disk. If you aren’t looking at noncompute services such as database and storage, you’re leaving a whole bunch of potential savings on the table.

Not Addressing Your Resource “Shape”

Rightsizing decisions shouldn’t be stuck within one compute family. Each cloud provider has a number of specialized compute families, and you should choose one that matches the resourcing shape of your workload. Within AWS and Azure, these families are offered in fixed shapes with a certain number of CPU cores, a certain amount of memory, and potentially other features like fast networking. For example, your workload may require four cores and 3 GB of memory, and might have been moved to the cloud running on an r6i.xlarge in AWS. This has the right number of cores, but dramatically too much memory. Moving instead to a c6i.xlarge not only reduces the cost by over half but also gets your shape correct by keeping compute consistent while at the same time reducing your memory. The way Google Cloud compute is purchased allows you to select the CPU and memory mix instead of picking from a large list of predefined instance configurations. You may also find that certain instance families have faster CPUs, which can sometimes allow you to lower the core count, or, alternatively, slower CPUs, which are offered at a lower price. By being thoughtful about the instance families you can pick from, you can achieve additional savings beyond simply picking the correct count of CPU and memory.

Not Simulating Performance Before Rightsizing

Your teams might be concerned with clipping (impacting performance when reducing a server’s resources), and that can drive suboptimal decisions. But with the means to get a forecast of how a recommendation affects their infrastructure, they can make better choices to optimize their cloud. Before making a rightsizing move, you must be sure to visualize the impact of each option and consider multiple recommendations across compute families. That way, you can assess the probability that there could be clipping (see Figure 15-2) and take that risk into account when compared to the savings you’ll realize. Without this step, you risk being too conservative (limited savings) or too aggressive (performance hit/outage).

Resizing based on averages will cause clipping
Figure 15-2. Resizing based on averages will cause clipping

Hesitating Due to Reserved Instance Uncertainty

A common refrain from teams is, “I don’t want to rightsize because it might affect my commitment coverage and cause waste.” There was a time when this was a valid worry and caution was required. But not anymore. All three cloud providers now have a number of options that give you far more flexibility and allow you to rightsize with confidence. When using reservations that have flexibility, you’re able to adjust your reservations if they’re impacted by your usage reduction efforts. We’ll cover these flexibility concepts in Chapters 16 and 17.

Going Beyond Compute: Tips to Control Cloud Costs

While compute resources make up a supermajority of cloud spending on many cloud bills, there are many other usage optimizations that are important to consider to cover other services that commonly can be optimized. Here we’ll look at some examples of other cloud service focus areas.

Block Storage

Closely related to cloud compute instances are the block storage devices attached to them. Looking at the big three cloud service providers, block storage services include AWS’s Elastic Block Store (EBS), Azure managed disk, and Google Cloud’s Persistent Disk.

Get rid of orphaned volumes

The major feature of block storage is that the volume persists after the compute instances stop. This is good for data retention, but bad if you’re paying for storage that’s no longer needed. Unused volumes are called unattached or orphaned volumes, and they’re marked as available if you check their status. These volumes can’t take any traffic without an attached compute instance, so they’re useless as is. A great first step to saving money is to get rid of them.

One option is just to terminate an unattached volume, but you should do so only if you’re confident that the data is never going to be needed. To be sure, you should see when the volume was last attached. If it was months ago, there’s a good chance you no longer need the volume. This is especially true for nonproduction environments. Some cloud providers allow you to leave the block storage available when a spot compute instance is removed, so before taking action, you should be sure you don’t have any data retention needs for this data.

Tip

There’s a great story about how the deletion of a nine-year-old unattached volume created a production issue on Episode 13 of the FinOpsPod podcast entitled “FinOops: Lessons Learned the Hard Way.” Be sure to check it out on Spotify or Apple Music.

A more cautious approach is to first take a snapshot of the volume and then terminate it. Snapshots are always cheaper than the original volume. They discard blank space, compress the data, and are stored in cheaper storage tiers like AWS’s S3 or Google Cloud’s Cloud Storage. If you do need to restore the volume, you can do it from the snapshot.

Focus on zero throughput or zero IOPS

Once you’ve gotten rid of unattached volumes, you look for attached volumes that are doing nothing. These often show up when the associated instances have been turned off and the volumes were forgotten. To discover these volumes, you look at the volume network throughput and IOPS (input/output operations per second). If there haven’t been any throughput or disk operations in the last 10 days, the volume probably isn’t being used.

Make managing your block storage costs a priority

With block storage volumes, you pay for two key attributes: storage and performance. Storage is charged per gigabyte stored, with a rate based on the location and volume type. For performance, the better it is, the more expensive it is, whether in terms of IOPS or throughput. Amazingly, volumes are often ignored when optimizing cloud expenditure.

Reduce the number of higher IOPS volumes

Volumes with higher IOPS guarantees (e.g., AWS provisioned IOPS volumes or Azure premium disk) aren’t cheap, and it’s relatively easy to change them. Using historic metrics to locate underutilized disks and having engineers change them where possible can greatly reduce disk costs.

Take advantage of elastic volumes

When you use AWS EBS, a volume can increase in size, adjust performance, or change volume type while the volume is in use. This can be done while an application is actively driving I/O to your volume. There’s no need for a maintenance window or downtime. There’s a big cost savings benefit, because you no longer have to overprovision your storage. Often engineering organizations will be far more open to taking recommendations that require no downtime, as they don’t have to deal with change tickets, downtime windows, or the risk of having to take additional downtime if they want to revert the size change.

Object Storage

Often referred to as unlimited storage, object storage services are highly likely to be the location with the majority of your data. Object storage services include AWS’s S3, Google Cloud’s Cloud Storage buckets, and Azure’s Blob Storage.

Implement data retention policies

The ease of storing data into object storage services and the fact that these data stores are theoretically limitless makes it easy for engineers to store data forever. In the data center, there was available disk space as a constraint and data retention policies had to be implemented to avoid unbounded storage growth. Moving into the cloud removes the challenges of making storage capacity available, but it should not remove the need for data retention policies.

Choose a storage tier/class that matches the data

By default, data sent to object storage services is often stored in the standard/hot storage class. Standard storage is usually the most expensive storage class but has the most availability and durability. Classifying your data and selecting the right storage class for the data type can save you significant amounts on object storage. Some cloud service providers provide automated classification and data class selection, such as AWS’s Intelligent-Tiering.

Networking

Networking costs can often be overlooked, possibly due to the fact that network traffic is not a named resource like a cloud compute instance or an object storage bucket. There are a few ways to optimize networking costs, but many will need strong collaboration with the team that manages the network within your organization.

Clean up unused IP addresses

It’s possible to be assigned fixed IP addresses from your cloud service provider; when these addresses are unassociated with a running compute instance, they often have an hourly charge. Unless there are business reasons to keep unused IP addresses, removing them can reduce some networking spend.

Optimize network routes

Some cloud service providers offer networking constructs (like AWS VPC endpoints) that enable access to their service APIs without using a public IP address or routing traffic via network address translation (NAT) services. Using these network constructs can reduce the cost of transferring data between your applications and the cloud services in use.

Usage Reduction by Redesigning

The most complex method of usage reduction is to redesign the services themselves. Having engineering teams change the way software is deployed, rewrite applications, or even change the software altogether can help you take advantage of cloud native offerings.

Scaling

Changing the size of a resource isn’t always the whole answer—the resources might be utilized perfectly well during business hours but not outside of them. Of course, if you resize the instance to be smaller, during business hours it will be overutilized and probably cause performance issues.

Teams might be able to architect production services to be more cloud native, allowing them to take advantage of the elasticity of the cloud. The most common approach is to have services scale out horizontally (i.e., provision more resources) during busy periods and then scale back in off-peak hours. The service itself needs to support this dynamic scaling, which may require redesigning the application code. Or scaling may not be possible at all, especially with proprietary software.

Modern methods of building more service-based, loosely coupled, modular, and stateless architected applications in place of monolithic applications are gaining traction due to their support in cloud and in container environments, so the ability to scale will appear in more applications going forward.

Scheduled Operations

Teams often leave development and testing environments running while they sleep. Consider a geographically centralized development team that’s using their resources for 40–50 hours a week and not using them at all for the remaining 118+ hours. If they’re able to turn off development resources for around 70% of the week, they can create massive savings for their organization. Depending on where a distributed team is located, there’s almost always 24 hours when everyone should be on a weekend. Incentivizing your engineering teams to turn off resources during off-hours by allowing them to use the savings on new projects or engineering efforts can help drive the cultural accountability of cloud spend.

Providing teams with automated ways to turn off resources when they’re not needed is the best method of ensuring that resources are stopped out of hours. We’ll cover automation later, in our discussion of the operate phase of the FinOps lifecycle.

Effects on Reserved Instances

We’re often asked, “What about all of my reservations?” If there are committed reservations such as Savings Plans (SPs), Committed Use Discounts (CUDs), or Reserved Instances (RIs), there’s always a concern about changes to your usage causing your reservations to become underutilized. Generally, you avoid this problem by performing usage optimizations before committing to rate optimizations like RIs or CUDs.

It can take some time to implement usage reductions, usually much longer than was initially expected. Almost every day we hear someone say, “I’ll make commitments after I clean up my infrastructure.” We’ve found that 9 times out of 10, cleaning up takes longer than expected—usually months—and during that time they end up paying for oversized resources at the higher on-demand rate.

Rightsizing is a very intimate thing and a long process. Engineers sweat over their infrastructure. It’s not as easy to say, “Well, it should be on a bigger or smaller instance.” There are test cycles and real effort required to make those changes. So, we take the approach that if we can centralize RIs and commitments, make good investments, and target low RI vacancy [waste] numbers, then the rightsizing will catch up eventually.

Jason Fuller, HERE Technologies

Priorities tend to change, and fires periodically must be extinguished. You should accept these as facts of life and look to get some immediate coverage of commitment-based discounts, regardless of how much waste you think you have. The strategy we often recommend is to start small with something like 20%–25% coverage, and then to slowly grow it in tandem with cleanup efforts. There will be more on creating an effective commitment strategy in Chapter 18.

Unless you’ve committed to every bit of your cloud usage, usage optimizations should still be possible. Going forward, you should take into account the amount of usage optimizations available to your organization before committing to rate optimizations. Most cloud service providers allow some ability to exchange or change commitments to match new usage parameters. With some planning, you can avoid being locked into the incorrect commitments and start your usage reduction efforts by modifying rate optimizations as you make changes. By rightsizing your instances across different families, you can often increase your coverage and reduce your instance spend at the same time. You can effectively “Tetris” your workloads in a more optimal way by fitting more instances into the RIs and discounts you’ve already purchased. It is more cost-effective to rightsize and discount your workloads, thereby potentially leaving some unused discount capacity, than it is to ignore rightsizing for fear of risking wasted discounts.

Benefit Versus Effort

When you’re looking at usage reduction recommendations, it’s essential to consider the possible savings against the engineering effort and risks to your production services. If the amount of time needed to investigate and implement the changes is more than the savings, it might be best to ignore these recommendations. Ideally, you filter out low-value and/or high-risk recommendations.

One of the FinOps Foundation members has their team look at its cloud spend in terms of engineering hours. They consider how many engineering hours the expected amount of savings would result in. If they can save 1,000 engineering hours by using only 3 engineering hours to capture those savings, they’ll do the work. If they’ll save only 5 engineering hours, then it’s less compelling.

Thinking of savings in terms of engineering hours helps teams to think of the savings they generate as a potential new engineer on their team. The more engineering hours saved, the more likely they will get additional headcount approved.

We don’t advise making changes without first investigating the impact of those changes. Sometimes teams perform changes—or worse, set up automation to force resource resizing—without understanding their impacts. This can lead to production issues for services.

Before investing time in performing changes to reduce usage, teams should determine the likelihood of reverting changes. If you’re expecting the workload to increase over the coming weeks, and the time it takes to roll out the decrease in size would mean you have to increase it almost immediately afterward, it would not be a good investment of time. Also, you must consider other projects, examining the benefits you would realize by making the changes. If you roll out a new service only days after resizing instances that are now replaced, then once again the time could have been spent elsewhere for better benefit to the business.

While the savings can appear small, especially in contrast to your whole cloud costs, it’s important to remember that savings compound over time. By removing an unnecessary resource, you prevent being charged for that resource in every month’s bill thereafter.

Serverless Computing

Serverless computing is a model in which the cloud provider runs the server and dynamically manages the allocation of machine resources. Pricing is based on actual activities executed rather than on prepurchased units of capacity.

This removes many of the unused or underutilized issues discussed earlier in this chapter. With serverless, you truly pay only for what you’re actively using. Unused resources aren’t typically easy to leave lying around. Serverless architectures can usually be ready to process requests very quickly, compared to starting new server instances and deploying your software when needed.

The move to serverless isn’t without cost, and it’s by no means a panacea to the wastage problem. There was recently a healthy debate within the FinOps Foundation on the best way to forecast and compare the serverless costs for an application versus its current compute-heavy architecture. Several exceptional methods were suggested, and in the end the discussion showed how much the practice is still evolving.

Ultimately, the complexity of building any migration plan to serverless resides in execution and has very little to do with cost savings. A recommendation to move from a large server instance to many concurrent serverless executions is meaningless, since the cost-associated savings in doing so pale in comparison to the engineering effort to get there. For the majority of cases, there’s little point in worrying about the forecasting or optimizing of serverless, because the real cost is not the cloud bill but the rearchitecting of applications. Put simply: serverless can be cheap, but refactoring for serverless is not. Thus, serverless is often a better option for greenfield projects rather than existing applications.

However, there’s an entirely different lens through which you can evaluate serverless: total cost of ownership (TCO), or the cost of the engineering teams that are required to build a solution, and the impact of time to market on the service’s success and profitability. Remember, serverless allows you to delegate a lot of responsibilities to the cloud provider. Duties that DevOps engineers would typically handle (server management, scaling, provisioning, patching, etc.) become the responsibility of AWS, Google Cloud, or Azure, which leaves dev teams free to focus on shipping differentiating features faster.

Too often—even in this book—the focus is on the cost of the infrastructure itself. But the biggest cost in software development is commonly the people. Consider this closely when looking at both sides of the serverless argument. The people cost (e.g., salaries) may cancel out any infrastructure savings when you’re considering a move from a monolithic application to a serverless one. Coming back to the benefits versus effort, you should consider the overall cost in redesigning services for serverless against the potential for reduced costs.

But when you’re building a new greenfield serverless service, the people cost savings may be well worth it. Remember that serverless can both speed up time to market (by preventing you from rebuilding functionality that is available off the shelf) and dramatically reduce ongoing ops requirements. These benefits allow you to redirect resources to building products instead of maintaining servers.

The entire discussion on serverless—as with all things FinOps—should be grounded in the goal of FinOps: it’s not about saving money; it’s about making money. On your cloud bill, serverless could actually end up being more expensive for certain applications, or it might save a ton of money for others. The real cost you incur to achieve those savings will be the people cost, but the real benefit you gain from serverless may well be shorter time to market. And when it comes to competitive advantage, opportunity costs outweigh many real costs.

Not All Waste Is Waste

If you start yelling at everyone who has resources that appear to be oversized, eventually your teams start yelling back. (Of course, when doing FinOps, you don’t have to yell because there are collaboration, trust, and agreed-upon processes between teams.) Successful FinOps leverages the value of cross-functional teams having conversations with each other. Understanding that there are valid reasons to oversize resources can change the language and tone you use around rightsizing. Many a finance leader has created tensions with their engineering leader by approaching an executive with a list of oversized resources, claiming widespread and rampant waste.

Tip

Fixing inefficiencies won’t always be the most important thing. The prioritization of addressing waste versus everything else needs to be set by engineering leadership. The people making those decisions should be doing so from a place of solid information and understanding of both the cloud environment and the business priorities. Savings opportunities must be communicated in a manner that allows them to be apples-to-apples compared against competing priorities.

Where systemic waste has been consciously decided upon, aim to have a way of filtering those out (ideally with some time bounding so they get revisited later) so they don’t create noise obfuscating the truly addressable opportunities. The process of excluding items from review should also have oversight.

Until you have confirmation from the teams responsible for the resource, you can’t be sure there isn’t a valid reason for overprovisioning. What’s important is that someone spends some time investigating the reasons why the resource is oversized and either makes adjustments to its size or provides context for why the resource is sized as it is. This is why you decentralize usage reduction to the teams responsible for each application. While a centralized FinOps team can help provide recommendations and best practices on rightsizing, the ultimate action needs to fall on the application or service owner.

One valid reason for overprovisioning is hot/warm disaster recovery. A resource is sized to allow production traffic to be moved onto the resource fast, meeting the recovery time objectives of the team’s service. Another could apply for services that need extra capacity in the event of a failure. During normal day-to-day operation, the service doesn’t need the additional size. Nevertheless, it requires it in the event of failure.

Even teams that are fully optimized can be surprised. Optimization opportunities often appear via external factors that are not in a team’s control. A price drop, a performance increase, a service release, a change in architecture, or a similar event might trigger the need to optimize or rightsize. Yet the team has little ability to foresee these events or plan for them. So scorekeeping due to external factors may not be entirely fair.

Again, FinOps is about collaboration between teams. Resizing resources without the input of the teams that manage them may result in issues for the services and also for future optimization efforts.

Tip

Unless application health and performance are prioritized over savings, engineering teams will tend to view rightsizing with skepticism. Once performance-focused rightsizing recommendations can be trusted and accepted, you will see their acceptance level begin to rise.

Your usage optimization workflow should allow for these reasons to be tracked. You can use a ticketing system to enable you to define this workflow—and to record the communications and investigations. A ticketing system also allows you to track actions over time and work out what outstanding tasks you have and their current status.

Where there’s a business case for oversizing a resource, it’s no longer classified as waste. Once the team that owns the resource provides a valid reason for the sizing of a resource, it helps to have a process to mark that investigated resource, removing the savings recommendation from your tracking.

Remember that even just the investigation of savings opportunities that aren’t ultimately actioned or actionable can be important opportunities for the FinOps team to work collaboratively with the product and application teams, to learn more about the application environments, and to establish trust.

Maturing Usage Optimization

Using a gradual, incremental improvement strategy is a recurring theme in FinOps, and it applies to usage optimization as well. You shouldn’t be trying to solve all wasted usage at once. You should instead identify the usage reduction method that is most likely to save your organization money. And in the early stages, that’s usually idle resources.

While doing so, you build out reporting, showing your organization the size of the problem and the potential to save. Then you aim to optimize the top 10 items on the list and review the effectiveness of your engineers’ efforts. Having that feedback loop showing the results of efforts put in by your teams enables trust in the process, and it shows the organization that continued effort is a benefit.

Starting small reduces the risks to your organization, because many of the strategies that enable usage reduction come with changing resources and/or software that is used to deliver services to customers. Decreasing the number of in-process changes at any one time will lower the chance that these changes might greatly affect the business.

We suggest starting with the lowest-impact recommendations first, such as cleaning up unused volumes and IP addresses. Test out rightsizing recommendations in nonproduction environments, and get the broader organization used to the variable usage benefits that cloud provides via its ability to change resource sizing to match needs. This elasticity provides an enormous benefit to your business, but only if your organization can integrate this behavior. Actioning block storage volume rightsizing recommendations requires no downtime and very low risk, another great entry point into the world of rightsizing. Remember FinOps principle 6, which reminds us to take advantage of the variable cost model of the cloud.

Advanced Workflow: Automated Opt-Out Rightsizing

The hardest part of usage optimization is having teams take responsibility and perform the actions needed. Recommendations alone don’t make you more efficient—the workflow and practice of implementing the changes required is where savings are realized.

Automating changes to an environment is an advanced FinOps process and would be recommended for mature practices. At a minimum, you should consider tracking your recommendations and having your teams manually implement the needed changes.

We talked previously about using tickets to track rightsizing recommendations, but it goes deeper than just monitoring feedback. Using a ticketing system should enable you to identify:

  • The number of investigations into your recommendations

  • How many of the investigations end up with savings versus no action being taken

  • How long, on average, your recommendations are sitting without movement

  • How many of your teams are actively performing the changes you recommend

  • The type of recommendation feedback, including business justification or technical justification for not rightsizing

Drive tickets to teams and use JIRA to your advantage because that’s how engineers do their planning. So if your optimization is in a waste ticket type X assigned to Team Y they have to plan for it or close it with punishment.

Jason Fuller, HERE Technologies

Tickets assigned to a responsible owner always work better. Assigning tickets to team members generally makes them feel more accountable for cost wastage.

One-off automated changes or resizing of resources that are running is typically not a successful pattern. In organizations using infrastructure as code, the right way to fix it is to actually change the code and restack the environment, instead of playing a whack-a-mole game against runtime, which will just revert back to the inefficient configuration upon restacking, in the absence of code updates.

Jason Rhoades, Intuit

Tracking Savings

Unlike rate optimization, there are no details in your cloud bill to show the discount or savings you get doing usage optimizations. Unless you have a process that enables you to directly attribute the change in instance sizes to your usage reduction efforts, the benefits of all that hard work will be difficult to track.

Usage optimization is also known as cost avoidance, because the only sign of savings in your bill is the lack of charges. Looking at the bill, you might notice a reduction in usage and be tempted to sum this up as savings made by your usage optimization efforts. However, just because you recommended that a team look into resizing a resource doesn’t mean that they made the changes based on that recommendation. Alternatively, you might see resource usage increasing even though teams may be implementing your recommended rightsizing changes. And as if that weren’t enough, any reduction in usage could be hidden by new resource usage for other projects. So trying to show the individual savings made across your cloud resources becomes very difficult.

My FinOps teams used “mini business cases” to document all the opportunities they identified for savings. These could be some sort of ticket ideally, or small document, or even a row on a shared spreadsheet. We document each opportunity, whether that’s a rightsizing option, a stranded resource, a VM modernization, a license conversion, etc. We document them and track them over time to create a ledger of what opportunities we’ve recommended, and then ideally track which were actioned. It’s a key way for a FinOps team to track and quantify their activities and justify cost avoidance claims throughout the year.

Rob Martin, Director of Learning, FinOps Foundation

Some recommendations lend themselves easily to tracking their cost impact. For example, AWS r5 instances cost 45% less than r3 instances. Every day that you run an r3 and don’t pull this trigger, you’re wasting money. However, there are also important technical reasons that might preclude moving to r5 instances (the compatibility of Amazon EMR [Elastic MapReduce] versions we’re running, for example), so in addition to the opportunity value, you need to know the cost of implementing it and the context of technical barriers to implementation.

Recommended changes might also have opportunity costs—requiring important team resources who are working on other projects, for example, or delaying the migration of other systems to cloud. The results of these could make the change fiscally inadvisable.

Teams can be defensive when given optimization recommendations from outside the team. We encourage FinOps teams to use these as opportunities to start conversations. You should ask questions about justifications and service use rather than presenting them as business cases. A formal business case seems like it would provide clear and constructive direction, but it can put a lot of pressure on a team to justify their past actions and selections. It might feel like you’re attacking the premise or content of the business case rather than the opportunity. These are the kinds of experiences that cause teams to shy away from working with the FinOps team in the future.

For FinOps teams who meet with their teams (or top spenders) regularly, it’s effective to discuss recommended optimizations in each meeting. This way, they can be raised as potential opportunities in one meeting, teams can investigate which are possible/advisable, and then they can work with the FinOps team during the month to build the mini–business case and schedule activity. In subsequent meetings, they can report on progress and can report cost avoidance for their team. This puts the savings (or the reason it’s not being done) on the record in the FinOps ledger and meeting notes. It also provides information for the central FinOps team to use in evaluating commitments and discounting activities.

An alternative way to track usage optimization efforts is to look at the recommendations being made. If you sum up the total potential savings you could make by implementing all your recommendations today and then compare that figure to the amount you could have saved at some point in the past, you can determine whether your organization is getting more optimized or less so. Taking this total potential savings figure, and working out how much of a percentage compared to the total spend you have on the same resource type, allows you to determine a potential savings percentage.

For example, say you have $10,000 of total potential savings recommendations for server instances. If you currently spend $100,000 on server instances, you’re operating at a 10% potential savings. If you divide the savings recommendations and total spend (using the tags and account allocation strategies discussed in Chapter 12), you can compare one team to another.

Tip

To calculate potential optimization savings, consider what the life span of the positive effect is. For instance, if you terminate an orphaned object now that saves $1,000/month, how long do you keep counting that $1,000? For the rest of that month, quarter, or year? One answer might be to consider “how long would this thing have kept going were it not for the FinOps waste program” and to consider the delta in time between when it got fixed and this length of time. These timeframes can be different across different types of waste and across teams, so there is no hard and fast rule to apply.

However you decide to calculate savings and track them over time, do it consistently, document your assumptions, and have it ready. Because one thing most FinOps practitioners report is that sooner or later the leadership in charge of the FinOps team will stop by to ask how much the FinOps team has saved by optimizing, and they usually want an answer right away.

Effective FinOps practitioners focus on the teams with the highest potential savings percentages. They assist those teams in understanding the recommendations, explain how to opt out of the recommendations when doing so makes sense, and provide expertise on the impacts of resizing resources.

Gamifying the work of optimizing, particularly for teams that are more in steady-state or maintenance modes, can be a good driver of behavior. Depending on the company’s culture, it might be possible to use a worst offenders list, which uses the idea of being crowned the most wasteful to pressure teams to look more seriously into the recommendations. But negative metrics of wastefulness, or ones that call out offenders, might not be taken well. The (Total CostSavings Opportunities) / Total Cost formula is a percent-optimized metric that can be stated positively. If you build 100% optimized, your score is 100. If you don’t, it’s lower. You can track cumulative unoptimized cost as a tracker over time, but you want to encourage both optimization work and optimized architecture.

One of the best ways to influence positive behavior is to give each organization a savings accumulation graph that totals up the savings achieved from taking recommendations over time. Because savings can compound, teams will quickly see that the more they take rightsizing seriously, the more their savings graph will compound, giving them an impressive hockey-stick-like growth trajectory month over month. By taking 10 recommendations per month, by the end of six months, they will have the accrued benefit of 60 rightsizing recommendations. While the first few months might look like a linear growth curve, the savings numbers begin to look impressively large as they multiply over time.

Conclusion

Usage optimization is often more difficult than rate optimization, and it usually involves multiple teams working together to have the right recommendations, investigations, and changes implemented by the correct teams. The savings that can be realized by usage optimization can be significant.

To summarize:

  • Usage optimization is about using only the resources needed for a workload and only when that workload needs to be running.

  • Visibility into waste can lead to a leaner culture inside an organization, as teams become more aware of the impacts of waste.

  • Use high-quality rightsizing recommendations to avoid stalling out any rightsizing initiatives or causing unnecessary engineering pushback.

  • Bring optimization options to engineering teams with discussion and questions rather than mandates, to allow correct decisions to be made with all the facts.

  • Formal workflows around usage optimization, especially when combined with automation, can lead to the best outcomes.

  • Usage optimization is the hardest process to reduce cloud costs, and it should be implemented carefully using an incrementally improving approach.

  • Track optimization savings closely to show the impact of the FinOps work, and work collaboratively with teams on a regular cadence to investigate opportunities, remembering that not all of them can be acted on.

Usage optimization can cause rate optimization issues due to the usage you’ve committed to being (re)moved based on the recommendations. It’s crucial to take usage optimizations into account when performing rate optimizations to avoid making commitments on usage that will change.

Now that we have covered optimizing usage, we’ll proceed to rate optimization, where you’ll save further by reducing the rate you pay for cloud resources.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset