Taming AWS Costs

Considering Cloud Costs

Throughout my tenure as an engineering leader, I have overseen cloud budgets that extend well into the millions of dollars.¹ These budgets typically encompass multiple vendors, with the compute platform, almost always AWS, occupying a significant portion.

With my background and focus on AWS, I have developed a playbook for optimizing costs that has proven effective in keeping expenses in check.

Before diving in, it's worth considering when you should prioritize costs. At a high level most early to growth stage companies will focus less on cost optimization while companies approaching later stages of growth will look to tie costs to metrics like Cost of Goods Sold.²

Cost Explorer

In order to make headway on cost reduction, we first need to understand where the money goes today.

Cost Explorer is a powerful tool to help you understand and manage your AWS costs. It enables you to analyze your usage and pinpoint where your money is being spent. With it you can view both historical data and service-level spend in a more granular manner.

Analysis is organized into reports which can be saved and referenced later. I would recommend setting up reports which capture your high-level costs as well as more granular reports (for example, of your most expensive resources) which show deeper analysis of specific resources.

Resource Tagging

To further enhance the functionality of Cost Explorer, you should use resource tagging. Tagging allows you to create custom reporting based on key-value metadata associated with resources. With tags, you can help drive cost accountability by assigning standardized tags to all resources, e.g. such as owning team. Doing so makes it possible to better share cost accountability across the wider function.

Note It's important to underscore that the value of tags is directly correlated with the effort you put into designing them. It's difficult to give prescriptive advice, since this will largely depend on your specific needs but there are some common best practices for establishing a good tagging strategy.

As an example, consider a tag anatomy that's consistent and captures specific qualities about the resource, such as the department or team that owns it, the service or project it belongs to, and so on.

Optimization Levers

Cost optimization strategies encompass a variety of options and tools, such as aligning usage with resource allocation and purchasing compute resources through AWS, which may offer better pricing but require an upfront commitment for a specific duration.

Here are some of the approaches I've used:

Right Sizing

Right sizing is the process of aligning your infrastructure with your real-world needs. It involves auditing your resources, analyzing their usage, and projecting future growth. It's not uncommon to uncover low-hanging fruit during such an audit, such over-provisioned hardware or resources that are entirely unused.

Auditing real world usage is also an effective way to gain a broader understanding of your infrastructure and its relationship with cost. It enables you to document each service and its corresponding infrastructure, providing valuable insights into their relative weight within your budget.³

Graviton

Without getting into the weeds of CPU architecture, the primary thing to understand about Graviton (an ARM-based CPU designed by AWS) is that it's more efficient which means it generally costs you less for the same or better performance.⁴

What this means is that compute instances you currently use that leverage x86 CPU architecture can potentially be replaced with their Graviton counterparts. This can be a quick win and provide immediate cost savings without trading off performance.

Reserved Instances

Most of us are familiar with AWS's on-demand pricing: this is the retail price quoted for a resource, usually billed based on usage. If you setup a new AWS account and spin up an EC2 instance, this is likely how you're being billed.

However AWS offers the ability to purchase reserved instances of some compute resources. This is effectively a contract which says you agree to pay AWS a certain rate for a certain period of time to reserve the resource. Generally speaking, these offer significant discounts compared to on-demand. Upwards of 72% compared to on-demand.

The catch is you agree to a contract term, usually between 1 and 3 years. So for this to be an effective strategy, you need to have some idea of what your resource needs will look like over the contract term.

Spot Instances

Reserved instances are discounted in part because of the upfront commitment to AWS on your part whereas spot instances offer discounts based on your flexibility to cycle through compute resources as you operate your application.⁵ Said another way, if your application can tolerate losing its compute resources temporarily, then AWS will give you a steep discount for using this strategy. Upwards of 90% compared to on-demand.

The key thing to understand here is that the application itself needs to be architected such that e.g. its compute resource can be quickly replaced with another while it's operating. For some applications this is trivial and for others it's a tall order.

Consider adopting spot instances where your applications are already a good fit for this, such as with stateless services.

Savings Plans

AWS offers savings plans which include a Compute Savings Plan and an EC2 Instances Savings Plan. Like reserved instances, these savings plans are a commitment to pay some rate to AWS for some term (e.g. 1 to 3 years) and in exchange AWS gives you a discount on EC2, Lambda, and Fargate.

Compute Savings Plan: This is the most flexible and will be applied to all eligible instances regardless of family type or region and is applicable to Lambda and Fargate. These plans provide strong discounts. Upwards of 66% compared to on-demand.
EC2 Instance Savings Plans: This plan is less flexible: it requires specifying the eligible instance families in a given region and is not applicable to Lambda or Fargate. However, these plans provide the strongest discount. Upwards of 72% compared to on-demand.

It's important to note that these savings are applied after your reserved instances. Likewise, EC2 instance savings plans are applied before the more flexible compute savings plans. What this means is that discounts are applied in the most advantageous way but aren't stacked.

Note The percentage discounts offered by savings plans and reserved instances are essentially identical but these mechanisms are different and will apply to different needs. It's important to map this back to your specific business needs in order to decide which fits best.⁶

Enterprise Discount Program

When you reach a certain threshold of spend, an Enterprise Discount Program (EDP) may become an option available to you.⁷

EDPs are discount programs designed for your business needs. Their goal is to help you achieve better economies of scale with AWS and in exchange you commit to specific, and increasing, levels of spend with AWS. These are often multi-year contracts and are only appropriate for businesses looking to further their investment in AWS. However in exchange you receive better pricing than you could otherwise get as well as several other bells and whistles.

It's important to note that while an EDP can be a valuable investment, it's unlikely to be point you should be starting cost optimization from.⁸

Safeguards

Runaway costs are a terrifying possibility of cloud services like AWS: it's far easier than we might hope to spend a lot of money quickly without realizing or intending. To help avoid this, AWS provides cost management services to help ensure you're staying within a budget and that unexpected costs are identified and caught quickly. These include budgets and cost anomaly detection.

Budgets can be established for individual teams and projects. In my experience, it's best to be more granular with these when possible. This can be part of a larger cost accountability effort which empowers smaller teams to own their end-to-end costs.

Similarly give consideration for growth: if you establish alerts but ignore them because they trigger with expected growth, then you risk missing important notifications later on when costs are growing in unexpected ways. The key here is to plan according to your expectations and to establish your alerting in accordance with that.

With these tools you can set guidelines for expected spend and get alerted when things aren't going as expected.

It's worth noting this isn't a perfect solution: there can be delays in reporting usage which means it's still possible to exceed a budget or discover an unexpected charge after the fact.⁹ As such, it's important to ensure that your process for managing infrastructure also supports predictable usage and helps avoid orphaned resources, account takeover, and other vectors for unbounded costs.

Getting the Most Out of Your Dollar

Optimizing AWS costs can be a complex process, and achieving the best results will depend on your business needs and the structure of your technical product. Nevertheless, the techniques we've explored can provide helpful starting points to help you stretch your engineering budget even further. With careful planning, you can get the most out of your AWS investment and build with more efficient and cost-effective infrastructure.

Do you have thoughts or something you'd like to add? Please shoot me an email.

In one case, my teams captured over 30% in savings on multiple millions of dollars in annual costs. We did it using this playbook. ↩
Recent macro economic changes might shift this priority forward. If the goal is "default alive", then your engineering budget and cloud costs may be a consideration. ↩
This is generally a good practice that should be conducted regularly and can eventually be automated. ↩
As AWS claims,

"AWS Graviton processors are custom-built by AWS to deliver the best price performance for cloud workloads."

↩
This is how AWS takes advantage of idle capacity: they resell it but with important constraints in order to maintain the liquidity of their compute market. ↩
My teams have used both reserved instances and savings plans together. One strategy is to under provision your savings plans and then leverage reserved instances for targeted compute needs. It's worth doing the math to determine the best approach. ↩
Usually this is over $1,000,000 annually. ↩
Moreover EDPs aren't available to all AWS customers. See the spend threshold, for example. ↩
While perhaps not an official capability, folks have historically had some success appealing billing mishaps to AWS support. Do not rely on this and ensure you take precautions to avoid this from the outset. ↩

Considering Cloud Costs

Cost Explorer

Resource Tagging

Optimization Levers

Right Sizing

Graviton

Reserved Instances

Spot Instances

Savings Plans

Enterprise Discount Program

Safeguards

Getting the Most Out of Your Dollar

Footnotes

A Newsletter to Share My Knowledge