Organizations often expect to cut costs when migrating IT infrastructure from on-premises setups to the cloud. However, the reality can be starkly different. Various cost traps can emerge during cloud transformation, leading to unexpectedly high expenses if not carefully managed.
In this article, we will explore the most common cloud cost traps and provide strategies to avoid cloud expenditures during cloud transformation.
Estimating Cloud Transformation Expenses
Forecasting expenses during a cloud transformation can be challenging. In traditional IT setups, cost estimation is relatively straightforward, encompassing fixed costs such as data center rent, hardware, and licenses. Conversely, in a cloud environment, you pay based on usage, which can fluctuate greatly, complicating cost predictions.
This variability often leads to unforeseen cost spikes during cloud transformation.
At Gart Solutions, we help numerous organizations navigate their cloud transformation journeys and frequently encounter five key mistakes that can drive up costs.
Common Cloud Costs Optimization Traps Happening There:
1. The "Lift and Shift" Approach
A common mistake is migrating existing setups to the cloud without modifications, known as a "lift and shift." This method usually leads to high costs because it fails to leverage the cloud's unique advantages.
Instead, organizations should modernize their solutions through "refactoring." Refactoring involves changing or completely replacing applications or systems to exploit the cloud's benefits such as scalability, elasticity, self-service, and measurability. By optimizing resources for the cloud, organizations can achieve greater flexibility, efficiency, and cost savings.
2. Choosing the Wrong IT Architecture Setup
Selecting the right architecture for a cloud environment is critical. IT architecture must be carefully planned to maximize cloud benefits. Determining how services interact and which ones to use requires an architecture tailored to the organization’s needs. Poor architectural choices can result in an environment that does not meet security and scalability requirements, leading to costly adjustments. Hence, it's vital to seek expert advice, critically review recommendations, and explore alternatives that do not compromise functionality or performance.
Contact Gart & get the IT Infrastructure Consulting. Quick and long-term wins are guaranteed.
3. Overreliance on Enterprise Versions
Organizations often default to enterprise versions of cloud services without assessing their actual needs. This can result in unnecessarily high costs. Before opting for enterprise versions, it's crucial to evaluate if standard versions can meet the requirements.
Contact Gart & assess cloud services carefully.
4. Uncontrolled Capacity Planning
Accurately predicting capacity needs is a common challenge, where previous expertise is needed. Capacity requirements can vary — constant for some, linear for others, exponential for a third, and seasonal for a fourth. These variations can lead to over- or under-provisioning of resources, resulting in additional costs. Effective capacity planning involves basic estimates and continuous monitoring to detect changes in needs and adjust resources accordingly. Utilizing the built-in capabilities of cloud services for business alignment requires the right expertise.
Contact Gart to assess your resources and requirements and make optimal capacity planning.
5. Lack of Appropriate Skills
Cloud transformation increases IT delivery complexity and requires new skills. Without the right expertise, projects can become inefficient, leading to wrong decisions and increased costs.
Contact Gart - we are the trusted partner in cloud consulting, migration, and cost optimization.
6. Underestimating Data Transfer Costs
One often overlooked cost trap is data transfer. While moving data to the cloud might seem straightforward, the costs associated with data egress (data leaving the cloud) can be substantial. Organizations should be aware of these charges and plan data transfer strategies accordingly, such as optimizing data placement and minimizing data transfer needs.
7. Neglecting Long-term Cost Implications
Focusing solely on immediate cost savings can lead to higher long-term expenses. Organizations should consider the long-term implications of their cloud choices, including potential costs associated with scaling, maintenance, and upgrades. A comprehensive cost-benefit analysis that includes future scenarios can help in making more informed decisions.
8. Ignoring Security and Compliance Costs
Security and compliance are critical in cloud environments, and neglecting these aspects can lead to significant expenses. Ensuring that cloud deployments comply with regulatory standards and implementing robust security measures can prevent costly breaches and fines. Investing in security tools and expertise upfront is crucial for avoiding unexpected costs later.
Contact for consulting in security, architecture, and cloud technologies. Prevent unnecessary expenses and ensure a smooth transition.
9. Failing to Monitor and Optimize Usage
Continuous monitoring and optimization of cloud usage are essential for controlling costs. Without proper monitoring, organizations can easily overspend on unnecessary resources. Utilizing cloud management tools and implementing policies for regular audits and optimizations can help keep costs in check.
10. Overlooking Vendor Lock-in Risks
Vendor lock-in occurs when an organization becomes overly dependent on a single cloud provider, making it difficult and costly to switch providers. To mitigate this risk, organizations should adopt a multi-cloud strategy or ensure that their architecture allows for flexibility and portability across different cloud platforms.
11. Underutilizing Reserved Instances and Savings Plans
Cloud providers offer reserved instances and savings plans that can significantly reduce costs for predictable workloads. However, many organizations fail to take advantage of these options. By analyzing usage patterns and committing to long-term plans, organizations can achieve substantial savings compared to on-demand pricing.
12. Over-Provisioning Resources
Over-provisioning, or allocating more resources than needed, is a common cost trap. This often results from a lack of understanding of actual usage requirements. Implementing auto-scaling and right-sizing strategies can help in dynamically adjusting resources based on real-time demand, thus avoiding unnecessary expenses.
13. Lack of Governance and Policies
Effective governance and policies are crucial for managing cloud costs. Without proper governance, organizations can face uncontrolled spending and resource mismanagement. Establishing clear policies for resource usage, cost allocation, and accountability ensures that cloud resources are used efficiently, and costs are kept under control.
14. Neglecting to Decommission Unused Resources
Failing to decommission unused or idle resources can lead to wasted spending. Regularly reviewing and cleaning up unused instances, storage, and other resources can help in reducing unnecessary costs. Automation tools can assist in identifying and decommissioning these resources.
15. Ignoring Cloud Cost Management Tools
Most cloud providers offer cost management and optimization tools, but organizations often overlook these valuable resources. Utilizing these tools can provide insights into spending patterns, help identify cost-saving opportunities, and enable more effective budgeting and forecasting.
16. Misjudging Data Storage Costs
Data storage costs can add up quickly, especially if organizations do not manage their data efficiently. Implementing data lifecycle policies, such as archiving old data and deleting redundant data, can help control storage costs. Additionally, selecting the appropriate storage class based on access frequency and performance needs is essential.
17. Overlooking Software Licensing Costs
Cloud transformation can sometimes lead to increased software licensing costs, particularly if organizations do not review their licensing agreements carefully. It's important to understand the licensing implications of moving to the cloud and to negotiate contracts that align with cloud usage patterns.
18. Inadequate Backup and Disaster Recovery Planning
Backup and disaster recovery are critical components of cloud strategy, but inadequate planning in these areas can lead to unexpected costs. Organizations should ensure that their backup and recovery plans are cost-effective and align with their overall cloud strategy. This includes selecting the right tools and services for efficient data protection and recovery.
19. Failing to Leverage Cloud Provider Discounts
Cloud providers often offer discounts for committed usage, volume purchases, and other incentives. Failing to leverage these discounts can result in higher costs. Organizations should actively seek out and take advantage of available discounts to optimize their cloud spending.
Contact Gart and calculate cloud discounts for your case.
20. Not Considering Hidden Costs
Hidden costs, such as those related to network latency, data retrieval, and specialized support, can significantly impact the overall cloud budget. Organizations should thoroughly assess all potential costs, including those that may not be immediately apparent, to avoid budget surprises.
Final words
Cloud transformation requires specialized expertise & careful planning.
Contact us to address all organizational needs for a successful cloud transformation.
Gart Solutions can help your organization avoid common cost traps and achieve business objectives efficiently.
Review our latest Case Studies of our cloud migration & cost optimization projects.
Kubernetes is becoming the standard for development, while the entry barrier remains quite high. We've compiled a list of recommendations for application developers who are migrating their apps to the orchestrator. Knowing the listed points will help you avoid potential problems and not create limitations where Kubernetes has advantages. Kubernetes Migration.
Who is this text for?
It's for developers who don't have DevOps expertise in their team – no full-time specialists. They want to move to Kubernetes because the future belongs to microservices, and Kubernetes is the best solution for container orchestration and accelerating development through code delivery automation to environments. At the same time, they may have locally run something in Docker but haven't developed anything for Kubernetes yet.
Such a team can outsource Kubernetes support – hire a contractor company, an individual specialist or, for example, use the relevant services of an IT infrastructure provider. For instance, Gart Solutions offers DevOps as a Service – a service where the company's experienced DevOps specialists will take on any infrastructure project under full or partial supervision: they'll move your applications to Kubernetes, implement DevOps practices, and accelerate your time-to-market.
Regardless, there are nuances that developers should be aware of at the architecture planning and development stage – before the project falls into the hands of DevOps specialists (and they force you to modify the code to work in the cluster). We highlight these nuances in the text.
The text will be useful for those who are writing code from scratch and planning to run it in Kubernetes, as well as for those who already have a ready application that needs to be migrated to Kubernetes. In the latter case, you can go through the list and understand where you should check if your code meets the orchestrator's requirements.
The Basics
At the start, make sure you're familiar with the "golden" standard The Twelve-Factor App. This is a public guide describing the architectural principles of modern web applications. Most likely, you're already applying some of the points. Check how well your application adheres to these development standards.
The Twelve-Factor App
Codebase: One codebase, tracked in a version control system, serves all deployments.
Dependencies: Explicitly declare and isolate dependencies.
Configuration: Store configuration in the environment - don't embed in code.
Backing Services: Treat backing services as attached resources.
Build, Release, Run: Strictly separate the build, release, and run stages.
Processes: Run the application as one or more stateless processes.
Port Binding: Expose services through port binding.
Concurrency: Scale the application by adding processes.
Disposability: Maximize reliability with fast startup and graceful shutdown.
Dev/Prod Parity: Keep development, staging, and production environments as similar as possible.
Logs: Treat logs as event streams.
Admin Processes: Run administrative tasks as one-off processes.
Now let's move on to the highlighted recommendations and nuances.
Prefer Stateless Applications
Why?
Implementing fault tolerance for stateful applications will require significantly more effort and expertise.
Normal behavior for Kubernetes is to shut down and restart nodes. This happens during auto-healing when a node stops responding and is recreated, or during auto-scaling down (e.g., some nodes are no longer loaded, and the orchestrator excludes them to save resources).
Since nodes and pods in Kubernetes can be dynamically removed and recreated, the application should be ready for this. It should not write any data that needs to be preserved to the container it is running in.
What to do?
You need to organize the application so that data is written to databases, files to S3 storage, and, say, cache to Redis or Memcache. By having the application store data "on the side," we significantly facilitate cluster scaling under load when additional nodes need to be added, and replication.
In a stateful application where data is stored in an attached Volume (roughly speaking, in a folder next to the application), everything becomes more complicated. When scaling a stateful application, you'll have to order "volumes," ensure they're properly attached and created in the right zone. And what do you do with this "volume" when a replica needs to be removed?
Yes, some business applications should be run as stateful. However, in this case, they need to be made more manageable in Kubernetes. You need an Operator, specific agents inside performing the necessary actions... A prime example here is the postgres-operator. All of this is far more labor-intensive than simply throwing the code into a container, specifying five replicas, and watching it all work without additional dancing.
Ensure Availability of Endpoints for Application Health Checks
Why?
We've already noted that Kubernetes itself ensures the application is maintained in the desired state. This includes checking the application's operation, restarting faulty pods, disabling load, restarting on less loaded nodes, terminating pods exceeding the set resource consumption limits.
For the cluster to correctly monitor the application's state, you should ensure the availability of endpoints for health checks, the so-called liveness and readiness probes. These are important Kubernetes mechanisms that essentially poke the container with a stick – check the application's viability (whether it's working properly).
What to do?
The liveness probes mechanism helps determine when it's time to restart the container so the application doesn't get stuck and continues to work.
Readiness probes are not so radical: they allow Kubernetes to understand when the container is ready to accept traffic. In case of an unsuccessful check, the pod will simply be excluded from load balancing – no new requests will reach it, but no forced termination will occur.
You can use this capability to allow the application to "digest" the incoming stream of requests without pod failure. Once a few readiness checks pass successfully, the replica pod will return to load balancing and start receiving requests again. From the nginx-ingress side, this looks like excluding the replica's address from the upstream.
Such checks are a useful Kubernetes feature, but if liveness probes are configured incorrectly, they can harm the application's operation. For example, if you try to deploy an application update that fails the liveness/readiness checks, it will roll back in the pipeline or lead to performance degradation (in case of correct pod configuration). There are also known cases of cascading pod restarts, when Kubernetes "collapses" one pod after another due to liveness probe failures.
You can disable checks as a feature, and you should do so if you don't fully understand their specifics and implications. Otherwise, it's important to specify the necessary endpoints and inform DevOps about them.
If you have an HTTP endpoint that can be an exhaustive indicator, you can configure both liveness and readiness probes to work with it. By using the same endpoint, make sure your pod will be restarted if this endpoint cannot return a correct response.
Aim for More Predictable and Uniform Application Resource Consumption
Why?
Almost always, containers inside Kubernetes pods are resource-limited within certain (sometimes quite small) values. Scaling in the cluster happens horizontally by increasing the number of replicas, not the size of a single replica. In Kubernetes, we deal with memory limits and CPU limits. Mistakes in CPU limits can lead to throttling—exhausting the available CPU time allotted to the container. And if we promise more memory than is available, as the load increases and hits the ceiling, Kubernetes will start evicting the lowest priority pods from the node. Of course, limits are configurable. You can always find a value at which Kubernetes won't "kill" pods due to memory constraints, but nevertheless, more predictable resource consumption by the application is a best practice. The more uniform the application's consumption, the tighter you can schedule the load.
What to do
Evaluate your application: estimate approximately how many requests it processes, how much memory it occupies. How many pods need to be run for the load to be evenly distributed among them? A situation where one pod consistently consumes more than the others is unfavorable for the user. Such a pod will be constantly restarted by Kubernetes, jeopardizing the application's fault-tolerant operation. Expanding resource limits for potential peak loads is also not an option. In that case, resources will remain idle whenever there is no load, wasting money. In parallel with setting limits, it's important to monitor pod metrics. This could be kube-Prometheus-Stack, VictoriaMetrics, or at least Metrics Server (more suitable for very basic scaling; in its console, you can view stats from kubelet—how much pods are consuming). Monitoring will help identify problem areas in production and reconsider the resource distribution logic.
There is a rather specific nuance regarding CPU time that Kubernetes application developers should keep in mind to avoid deployment issues and having to rewrite code to meet SRE requirements. Let's say a container is allocated 500 milli-CPUs—roughly 0.5 CPU time of a single core over 100 ms of real time. Simplified, if the application utilizes CPU time in several continuous threads (let's say four) and "eats up" all 500 milli-CPUs in 25 ms, it will be frozen by the system for the remaining 75 ms until the next quota period. Staging databases run in Kubernetes with small limits exemplify this behavior: under load, queries that would normally take 5 ms plummet to 100 ms. If the response graphs show load continuously increasing and then latency spiking, you've likely encountered this nuance. Address resource management—give more resources to replicas or increase the number of replicas to reduce load on each one.
Leverage Kubernetes ConfigMaps, Secrets, and Environment Variables
Kubernetes has several objects that can greatly simplify life for developers. Study them to save yourself time in the future.
ConfigMaps are Kubernetes objects used to store non-confidential data as key-value pairs. Pods can use them as environment variables or as configuration files in volumes. For example, you're developing an application that can run locally (for direct development) and, say, in the cloud. You create an environment variable for your app—e.g., DATABASE_HOST—which the app will use to connect to the database. You set this variable locally to localhost. But when running the app in the cloud, you need to specify a different value—say, the hostname of an external database.
Environment variables allow using the same Docker image for both local use and cloud deployment. No need to rebuild the image for each individual parameter. Since the parameter is dynamic and can change, you can specify it via an environment variable.
The same applies to application config files. These files store certain settings for the app to function correctly. Usually, when building a Docker image, you specify a default config or a config file to load into Docker. Different environments (dev, prod, test, etc.) require different settings that sometimes need to be changed—for testing, for example.
Instead of building separate Docker images for each environment and config, you can mount config files into Docker when starting the pod. The app will then pick up the specified configuration files, and you'll use one Docker image for the pods.
Typically, iftheconfig files are large,youuse a Volume to mount them into Docker as files. If the config files contain short values, environment variables are more convenient. It all depends on yourapplication's requirements.
Another useful Kubernetes abstraction is Secrets. Secrets are similar to ConfigMaps but are meant for storing confidential data—passwords, tokens, keys, etc. Using Secrets means you don't need to include secret data in your application code. They can be created independently of the pods that use them, reducing the risk of data exposure. Secrets can be used as files in volumes mounted in one or multiple pod containers. They can also serve as environment variables for containers.
Disclaimer: In this point, we're only describing out-of-the-box Kubernetes functionality. We're not covering more specialized solutions for working with secrets, such as Vault.
Knowing about these Kubernetes features, a developer won't need to rebuild the entire container if something changes in the prepared configuration—for example, a password change.
Ensure Graceful Container Shutdown with SIGTERM
Why?
There are situations where Kubernetes "kills" an application before it has a chance to release resources. This is not an ideal scenario. It's better if the application can respond to incoming requests without accepting new ones, complete a transaction, or save data to a database before shutting down.
What to Do
A successful practice here is for the application to handle the SIGTERM signal. When a container is being terminated, the PID 1 process first receives the SIGTERM signal, and then the application is given some time for graceful termination (the default is 30 seconds). Subsequently, if the container has not terminated itself, it receives SIGKILL—the forced termination signal. The application should not continue accepting connections after receiving SIGTERM.
Many frameworks (e.g., Django) can handle this out-of-the-box. Your application may already handle SIGTERM appropriately. Ensure that this is the case.
Here are a few more important points:
The Application Should Not Depend on Which Pod a Request Goes To
When moving an application to Kubernetes, we expectedly encounter auto-scaling—you may have migrated for this very reason. Depending on the load, the orchestrator adds or removes application replicas. It's crucial that the application does not depend on which pod a client request goes to, for example, static pods. Either the state needs to be synchronized to provide an identical response from any pod, or your backend should be able to work across multiple replicas without corrupting data.
Your Application Should Be Behind a Reverse Proxy and Serve Links Over HTTPS
Kubernetes has the Ingress entity, which essentially provides a reverse proxy for the application—typically an nginx with cluster automation. For the application, it's enough to work over HTTP and understand that the external link will be over HTTPS. Remember that in Kubernetes, the application is behind a reverse proxy, not directly exposed to the internet, and links should be served over HTTPS. When the application returns an HTTP link, Ingress rewrites it to HTTPS, which can lead to looping and a redirect error. Usually, you can avoid such a conflict by simply toggling a parameter in the library you're using—checking that the application is behind a reverse proxy. But if you're writing an application from scratch, it's important to remember how Ingress works as a reverse proxy.
Leave SSL Certificate Management to Kubernetes
Developers don't need to think about how to "hook up" certificate addition in Kubernetes. Reinventing the wheel here is unnecessary and even harmful. For this, a separate service is typically used in the orchestrator—cert-manager, which can be additionally installed. The Ingress Controller in Kubernetes allows using SSL certificates for TLS traffic termination. You can use both Let's Encrypt and pre-issued certificates. If needed, you can create a special secret for storing issued SSL certificates.
In my experience optimizing cloud costs, especially on AWS, I often find that many quick wins are in the "easy to implement - good savings potential" quadrant.
[lwptoc]
That's why I've decided to share some straightforward methods for optimizing expenses on AWS that will help you save over 80% of your budget.
Choose reserved instances
Potential Savings: Up to 72%
Choosing reserved instances involves committing to a subscription, even partially, and offers a discount for long-term rentals of one to three years. While planning for a year is often deemed long-term for many companies, especially in Ukraine, reserving resources for 1-3 years carries risks but comes with the reward of a maximum discount of up to 72%.
You can check all the current pricing details on the official website - Amazon EC2 Reserved Instances
Purchase Saving Plans (Instead of On-Demand)
Potential Savings: Up to 72%
There are three types of saving plans: Compute Savings Plan, EC2 Instance Savings Plan, SageMaker Savings Plan.
AWS Compute Savings Plan is an Amazon Web Services option that allows users to receive discounts on computational resources in exchange for committing to using a specific volume of resources over a defined period (usually one or three years). This plan offers flexibility in utilizing various computing services, such as EC2, Fargate, and Lambda, at reduced prices.
AWS EC2 Instance Savings Plan is a program from Amazon Web Services that offers discounted rates exclusively for the use of EC2 instances. This plan is specifically tailored for the utilization of EC2 instances, providing discounts for a specific instance family, regardless of the region.
AWS SageMaker Savings Plan allows users to get discounts on SageMaker usage in exchange for committing to using a specific volume of computational resources over a defined period (usually one or three years).
The discount is available for one and three years with the option of full, partial upfront payment, or no upfront payment. EC2 can help save up to 72%, but it applies exclusively to EC2 instances.
Utilize Various Storage Classes for S3 (Including Intelligent Tier)
Potential Savings: 40% to 95%
AWS offers numerous options for storing data at different access levels. For instance, S3 Intelligent-Tiering automatically stores objects at three access levels: one tier optimized for frequent access, 40% cheaper tier optimized for infrequent access, and 68% cheaper tier optimized for rarely accessed data (e.g., archives).
S3 Intelligent-Tiering has the same price per 1 GB as S3 Standard — $0.023 USD.
However, the key advantage of Intelligent Tiering is its ability to automatically move objects that haven't been accessed for a specific period to lower access tiers.
Every 30, 90, and 180 days, Intelligent Tiering automatically shifts an object to the next access tier, potentially saving companies from 40% to 95%. This means that for certain objects (e.g., archives), it may be appropriate to pay only $0.0125 USD per 1 GB or $0.004 per 1 GB compared to the standard price of $0.023 USD.
Information regarding the pricing of Amazon S3
AWS Compute Optimizer
Potential Savings: quite significant
The AWS Compute Optimizer dashboard is a tool that lets users assess and prioritize optimization opportunities for their AWS resources.
The dashboard provides detailed information about potential cost savings and performance improvements, as the recommendations are based on an analysis of resource specifications and usage metrics.
The dashboard covers various types of resources, such as EC2 instances, Auto Scaling groups, Lambda functions, Amazon ECS services on Fargate, and Amazon EBS volumes.
For example, AWS Compute Optimizer reproduces information about underutilized or overutilized resources allocated for ECS Fargate services or Lambda functions. Regularly keeping an eye on this dashboard can help you make informed decisions to optimize costs and enhance performance.
Use Fargate in EKS for underutilized EC2 nodes
If your EKS nodes aren't fully used most of the time, it makes sense to consider using Fargate profiles. With AWS Fargate, you pay for a specific amount of memory/CPU resources needed for your POD, rather than paying for an entire EC2 virtual machine.
For example, let's say you have an application deployed in a Kubernetes cluster managed by Amazon EKS (Elastic Kubernetes Service). The application experiences variable traffic, with peak loads during specific hours of the day or week (like a marketplace or an online store), and you want to optimize infrastructure costs. To address this, you need to create a Fargate Profile that defines which PODs should run on Fargate. Configure Kubernetes Horizontal Pod Autoscaler (HPA) to automatically scale the number of POD replicas based on their resource usage (such as CPU or memory usage).
Manage Workload Across Different Regions
Potential Savings: significant in most cases
When handling workload across multiple regions, it's crucial to consider various aspects such as cost allocation tags, budgets, notifications, and data remediation.
Cost Allocation Tags: Classify and track expenses based on different labels like program, environment, team, or project.
AWS Budgets: Define spending thresholds and receive notifications when expenses exceed set limits. Create budgets specifically for your workload or allocate budgets to specific services or cost allocation tags.
Notifications: Set up alerts when expenses approach or surpass predefined thresholds. Timely notifications help take actions to optimize costs and prevent overspending.
Remediation: Implement mechanisms to rectify expenses based on your workload requirements. This may involve automated actions or manual interventions to address cost-related issues.
Regional Variances: Consider regional differences in pricing and data transfer costs when designing workload architectures.
Reserved Instances and Savings Plans: Utilize reserved instances or savings plans to achieve cost savings.
AWS Cost Explorer: Use this tool for visualizing and analyzing your expenses. Cost Explorer provides insights into your usage and spending trends, enabling you to identify areas of high costs and potential opportunities for cost savings.
Transition to Graviton (ARM)
Potential Savings: Up to 30%
Graviton utilizes Amazon's server-grade ARM processors developed in-house. The new processors and instances prove beneficial for various applications, including high-performance computing, batch processing, electronic design automation (EDA) automation, multimedia encoding, scientific modeling, distributed analytics, and machine learning inference on processor-based systems.
The processor family is based on ARM architecture, likely functioning as a system on a chip (SoC). This translates to lower power consumption costs while still offering satisfactory performance for the majority of clients. Key advantages of AWS Graviton include cost reduction, low latency, improved scalability, enhanced availability, and security.
Spot Instances Instead of On-Demand
Potential Savings: Up to 30%
Utilizing spot instances is essentially a resource exchange. When Amazon has surplus resources lying idle, you can set the maximum price you're willing to pay for them. The catch is that if there are no available resources, your requested capacity won't be granted.
However, there's a risk that if demand suddenly surges and the spot price exceeds your set maximum price, your spot instance will be terminated.
Spot instances operate like an auction, so the price is not fixed. We specify the maximum we're willing to pay, and AWS determines who gets the computational power. If we are willing to pay $0.1 per hour and the market price is $0.05, we will pay exactly $0.05.
Use Interface Endpoints or Gateway Endpoints to save on traffic costs (S3, SQS, DynamoDB, etc.)
Potential Savings: Depends on the workload
Interface Endpoints operate based on AWS PrivateLink, allowing access to AWS services through a private network connection without going through the internet. By using Interface Endpoints, you can save on data transfer costs associated with traffic.
Utilizing Interface Endpoints or Gateway Endpoints can indeed help save on traffic costs when accessing services like Amazon S3, Amazon SQS, and Amazon DynamoDB from your Amazon Virtual Private Cloud (VPC).
Key points:
Amazon S3: With an Interface Endpoint for S3, you can privately access S3 buckets without incurring data transfer costs between your VPC and S3.
Amazon SQS: Interface Endpoints for SQS enable secure interaction with SQS queues within your VPC, avoiding data transfer costs for communication with SQS.
Amazon DynamoDB: Using an Interface Endpoint for DynamoDB, you can access DynamoDB tables in your VPC without incurring data transfer costs.
Additionally, Interface Endpoints allow private access to AWS services using private IP addresses within your VPC, eliminating the need for internet gateway traffic. This helps eliminate data transfer costs for accessing services like S3, SQS, and DynamoDB from your VPC.
Optimize Image Sizes for Faster Loading
Potential Savings: Depends on the workload
Optimizing image sizes can help you save in various ways.
Reduce ECR Costs: By storing smaller instances, you can cut down expenses on Amazon Elastic Container Registry (ECR).
Minimize EBS Volumes on EKS Nodes: Keeping smaller volumes on Amazon Elastic Kubernetes Service (EKS) nodes helps in cost reduction.
Accelerate Container Launch Times: Faster container launch times ultimately lead to quicker task execution.
Optimization Methods:
Use the Right Image: Employ the most efficient image for your task; for instance, Alpine may be sufficient in certain scenarios.
Remove Unnecessary Data: Trim excess data and packages from the image.
Multi-Stage Image Builds: Utilize multi-stage image builds by employing multiple FROM instructions.
Use .dockerignore: Prevent the addition of unnecessary files by employing a .dockerignore file.
Reduce Instruction Count: Minimize the number of instructions, as each instruction adds extra weight to the hash. Group instructions using the && operator.
Layer Consolidation: Move frequently changing layers to the end of the Dockerfile.
These optimization methods can contribute to faster image loading, reduced storage costs, and improved overall performance in containerized environments.
Use Load Balancers to Save on IP Address Costs
Potential Savings: depends on the workload
Starting from February 2024, Amazon begins billing for each public IPv4 address. Employing a load balancer can help save on IP address costs by using a shared IP address, multiplexing traffic between ports, load balancing algorithms, and handling SSL/TLS.
By consolidating multiple services and instances under a single IP address, you can achieve cost savings while effectively managing incoming traffic.
Optimize Database Services for Higher Performance (MySQL, PostgreSQL, etc.)
Potential Savings: depends on the workload
AWS provides default settings for databases that are suitable for average workloads. If a significant portion of your monthly bill is related to AWS RDS, it's worth paying attention to parameter settings related to databases.
Some of the most effective settings may include:
Use Database-Optimized Instances: For example, instances in the R5 or X1 class are optimized for working with databases.
Choose Storage Type: General Purpose SSD (gp2) is typically cheaper than Provisioned IOPS SSD (io1/io2).
AWS RDS Auto Scaling: Automatically increase or decrease storage size based on demand.
If you can optimize the database workload, it may allow you to use smaller instance sizes without compromising performance.
Regularly Update Instances for Better Performance and Lower Costs
Potential Savings: Minor
As Amazon deploys new servers in their data processing centers to provide resources for running more instances for customers, these new servers come with the latest equipment, typically better than previous generations. Usually, the latest two to three generations are available. Make sure you update regularly to effectively utilize these resources.
Take Memory Optimize instances, for example, and compare the price change based on the relevance of one instance over another. Regular updates can ensure that you are using resources efficiently.
InstanceGenerationDescriptionOn-Demand Price (USD/hour)m6g.large6thInstances based on ARM processors offer improved performance and energy efficiency.$0.077m5.large5thGeneral-purpose instances with a balanced combination of CPU and memory, designed to support high-speed network access.$0.096m4.large4thA good balance between CPU, memory, and network resources.$0.1m3.large3rdOne of the previous generations, less efficient than m5 and m4.Not avilable
Use RDS Proxy to reduce the load on RDS
Potential for savings: Low
RDS Proxy is used to relieve the load on servers and RDS databases by reusing existing connections instead of creating new ones. Additionally, RDS Proxy improves failover during the switch of a standby read replica node to the master.
Imagine you have a web application that uses Amazon RDS to manage the database. This application experiences variable traffic intensity, and during peak periods, such as advertising campaigns or special events, it undergoes high database load due to a large number of simultaneous requests.
During peak loads, the RDS database may encounter performance and availability issues due to the high number of concurrent connections and queries. This can lead to delays in responses or even service unavailability.
RDS Proxy manages connection pools to the database, significantly reducing the number of direct connections to the database itself.
By efficiently managing connections, RDS Proxy provides higher availability and stability, especially during peak periods.
Using RDS Proxy reduces the load on RDS, and consequently, the costs are reduced too.
Define the storage policy in CloudWatch
Potential for savings: depends on the workload, could be significant.
The storage policy in Amazon CloudWatch determines how long data should be retained in CloudWatch Logs before it is automatically deleted.
Setting the right storage policy is crucial for efficient data management and cost optimization. While the "Never" option is available, it is generally not recommended for most use cases due to potential costs and data management issues.
Typically, best practice involves defining a specific retention period based on your organization's requirements, compliance policies, and needs.
Avoid using an undefined data retention period unless there is a specific reason. By doing this, you are already saving on costs.
Configure AWS Config to monitor only the events you need
Potential for savings: depends on the workload
AWS Config allows you to track and record changes to AWS resources, helping you maintain compliance, security, and governance. AWS Config provides compliance reports based on rules you define. You can access these reports on the AWS Config dashboard to see the status of tracked resources.
You can set up Amazon SNS notifications to receive alerts when AWS Config detects non-compliance with your defined rules. This can help you take immediate action to address the issue. By configuring AWS Config with specific rules and resources you need to monitor, you can efficiently manage your AWS environment, maintain compliance requirements, and avoid paying for rules you don't need.
Use lifecycle policies for S3 and ECR
Potential for savings: depends on the workload
S3 allows you to configure automatic deletion of individual objects or groups of objects based on specified conditions and schedules. You can set up lifecycle policies for objects in each specific bucket. By creating data migration policies using S3 Lifecycle, you can define the lifecycle of your object and reduce storage costs.
These object migration policies can be identified by storage periods. You can specify a policy for the entire S3 bucket or for specific prefixes. The cost of data migration during the lifecycle is determined by the cost of transfers. By configuring a lifecycle policy for ECR, you can avoid unnecessary expenses on storing Docker images that you no longer need.
Switch to using GP3 storage type for EBS
Potential for savings: 20%
By default, AWS creates gp2 EBS volumes, but it's almost always preferable to choose gp3 — the latest generation of EBS volumes, which provides more IOPS by default and is cheaper.
For example, in the US-east-1 region, the price for a gp2 volume is $0.10 per gigabyte-month of provisioned storage, while for gp3, it's $0.08/GB per month. If you have 5 TB of EBS volume on your account, you can save $100 per month by simply switching from gp2 to gp3.
Switch the format of public IP addresses from IPv4 to IPv6
Potential for savings: depending on the workload
Starting from February 1, 2024, AWS will begin charging for each public IPv4 address at a rate of $0.005 per IP address per hour. For example, taking 100 public IP addresses on EC2 x $0.005 per public IP address per month x 730 hours = $365.00 per month.
While this figure might not seem huge (without tying it to the company's capabilities), it can add up to significant network costs. Thus, the optimal time to transition to IPv6 was a couple of years ago or now.
Here are some resources about this recent update that will guide you on how to use IPv6 with widely-used services — AWS Public IPv4 Address Charge.
Collaborate with AWS professionals and partners for expertise and discounts
Potential for savings: ~5% of the contract amount through discounts.
AWS Partner Network (APN) Discounts: Companies that are members of the AWS Partner Network (APN) can access special discounts, which they can pass on to their clients. Partners reaching a certain level in the APN program often have access to better pricing offers.
Custom Pricing Agreements: Some AWS partners may have the opportunity to negotiate special pricing agreements with AWS, enabling them to offer unique discounts to their clients. This can be particularly relevant for companies involved in consulting or system integration.
Reseller Discounts: As resellers of AWS services, partners can purchase services at wholesale prices and sell them to clients with a markup, still offering a discount from standard AWS prices. They may also provide bundled offerings that include AWS services and their own additional services.
Credit Programs: AWS frequently offers credit programs or vouchers that partners can pass on to their clients. These could be promo codes or discounts for a specific period.
Seek assistance from AWS professionals and partners. Often, this is more cost-effective than purchasing and configuring everything independently. Given the intricacies of cloud space optimization, expertise in this matter can save you tens or hundreds of thousands of dollars.
More valuable tips for optimizing costs and improving efficiency in AWS environments:
Scheduled TurnOff/TurnOn for NonProd environments: If the Development team is in the same timezone, significant savings can be achieved by, for example, scaling the AutoScaling group of instances/clusters/RDS to zero during the night and weekends when services are not actively used.
Move static content to an S3 Bucket & CloudFront: To prevent service charges for static content, consider utilizing Amazon S3 for storing static files and CloudFront for content delivery.
Use API Gateway/Lambda/Lambda Edge where possible: In such setups, you only pay for the actual usage of the service. This is especially noticeable in NonProd environments where resources are often underutilized.
If your CI/CD agents are on EC2, migrate to CodeBuild: AWS CodeBuild can be a more cost-effective and scalable solution for your continuous integration and delivery needs.
CloudWatch covers the needs of 99% of projects for Monitoring and Logging: Avoid using third-party solutions if AWS CloudWatch meets your requirements. It provides comprehensive monitoring and logging capabilities for most projects.
Feel free to reach out to me or other specialists for an audit, a comprehensive optimization package, or just advice.