Spend less money on your cloud infrastructure



Here are ten ways how to make you infrastructure more cost efficient.

Back before we had access to cloud computing, developers and product owners had to order/request VM from either in-house IT or an MSP. This could be a very tedious process with a lot of gatekeepers (which was natural in the silo structured organization). Removing a machine was almost as cumbersome. The introduction of the cloud meant that not only were developers given quick and easy access to create and handle their machines, they were also able to scale very rapidly.

Cloud computing has revolutionized the way organizations operate by providing scalable and flexible IT infrastructure. However, as organizations move more of their workloads to the cloud, they are facing challenges related to cost management. Cloud services can be expensive, especially for organizations that have variable or unpredictable workloads.

Workloads moved to the cloud often used what is called a “lift and shift” approach, which was also recommended by the cloud providers. This approach was of course quite beneficial for the cloud providers as it meant, while you were dealing with transformation of your servers/applications, you still had to pay for servers now running in the cloud. The transformation project often got delayed or not fully implemented which also meant that a lot of machines didn’t get properly sized. For example a machine running on old AMD hardware in the datacenter had a specification of 4 vCPU and 32GB RAM, but in the cloud running on more modern HW could suffice to be running with 2 vCPU and 16GB RAM. Another problem emerging in the early days was also identifying used and unused workloads.

In this post, we will go through several tips that can help organizations reduce their cloud computing costs. By following these, you can optimize your cloud infrastructure and ensure that you are paying only for the resources you need.

1.Tagging (labels in GCP)

Tagging all your resources can help you when you need to identify if they’re being used, for what and also what team is behind them. Some tags (with examples) that we have found to be useful are:

  • Environment: Prod, Stage or Dev (It’s always good to know if you need to be extra careful handling a certain resource)
  • Team: Economy, Frontend, Gophers (Knowing what team to contact when in doubt if a resource is used or not)
  • Service: budget-aggregation-service, alignment-service, auth-service (A service can use multiple resources in the cloud, S3 bucket, Lambdas etc..)
  • Repository: github-repo-2 (This one is really useful if Terraform is being used, as it helps you locate from where a resource is being managed)
  • End-of-Life: 2024-04-15 (This is especially useful when you have a time scoped project)

Not all resources allow for tagging, in those cases you can fallback to naming conventions like prod-alignment-service-bucket.

2.Use On-Demand Instances

On-demand instances are a great option for organizations that need the flexibility to scale their resources up and down as needed. Instead of reserving resources ahead of time, organizations can launch instances when they need them and pay only for what they use. This can help organizations avoid over-provisioning and reduce costs associated with idle resources.

3. Utilize Spot Instances

Spot instances are a cost-effective way to run your applications in the cloud. They are available at a significant discount compared to on-demand instances, but there is a risk that the instances may be terminated when the spot price increases. Organizations can use spot instances for batch jobs, big data processing, and other tasks that can be interrupted without a significant impact.

4. Use Reserved Instances

Reserved instances are another cost-saving option for organizations that have a stable and predictable workload. With reserved instances, organizations can reserve resources for a one- or three-year term and pay a lower rate compared to on-demand instances. This is a great option for organizations that have a consistent workload and do not need the flexibility to scale their resources up and down as needed.

5. Right-Sizing of Instances

It is important to choose the right instance type for your workloads to ensure that you are not over-provisioning or under-provisioning resources. Over-provisioning can lead to idle resources and increased costs, while under-provisioning can result in performance degradation. Organizations can use tools like the AWS EC2 instance selector, GCP Recommender or the Azure Advisor to find the right instance type for their workloads.

6. Use Autoscaling (Horizontal)

Autoscaling is a feature that allows organizations to automatically scale their resources up and down based on current system load. With autoscaling, organizations can ensure that they have the resources they need when they need them and avoid over-provisioning or under-provisioning. Autoscaling can help organizations reduce costs by avoiding the need to reserve resources ahead of time and paying only for the resources they use.

7. Monitor and Optimize Resource Usage

Organizations should monitor their cloud infrastructure to identify and eliminate any unused or underutilized resources. Additionally, organizations should periodically review their instance types and adjust them as needed to ensure that they are not over-provisioning or under-provisioning.

8. Budgets

Most Cloud providers today offer good dashboards that will show and breakdown your spendings as well as projected spendings. You can also create custom budgets with alerts and warnings, this allows you to catch a misconfigured database or similar before the invoice lands in your inbox.

9. Cloud Policies

Global policies can be set up so that you can enforce certain tags to be present for a resource to be allowed creation. The drawback with this, is that you do not always get the best error messages from the APIs when your resource creation requests fail.

Another useful scenario where policies are helpful is that you can limit certain instances in certain environments, so Dev might not be allowed to spin up costly GPU instances like: NCsv3 series.

10. Infrastructure as code (IaC)

By following the principles of IaC you can leverage code to apply tags in an automated fashion. By making Terraform modules that are re-usable other teams can get tags in place for “free”. Another benefit of using IaC is the ability to easily cleanup resources you have created during a POC or test.
If Terraform is being used you also have the ability to incorporate Infracost into your GitOps workflow which will give you direct feedback into your pull request as to how your code changes will impact the cost.

Bonus tip =)

Another fun scenario might be to destroy your test environments every night only to have it being re-built every morning before the developers start their work.

Conclusion:

These are not all the possible ways that you can help reduce your cloud costs, but a good starting point. The main thing is to start being cost aware and give the developers the tools and insights to see what their services use and cost. Is it worth spending some extra time optimizing your code to run more efficiently? Or just bump the CPU of the instance? Could the next sprint goal be to try and cut cloud costs by 10%? Maybe hire another developer with the money saved?

Author:
Jonathan Kyrklund
Platform Engineer