alphalist Blog

Spot instances decoded: Understanding, implementing, and saving

Share

written by

Matan Bordo

Product Marketing Manager @ DoIT

Despite potentially reducing your compute costs anywhere from 20-80% and indirectly resulting in your applications being more resilient, Spot instances (or Spot VMs on Google Cloud) aren’t typically used as much as they should be. While reasons range from their less-than-predictable nature, a fear of workload interruptions, or the complexity of setting up and managing them, the most common reason why many avoid using Spot instances is simply a lack of familiarity with them. However, understanding how to navigate these perceived hurdles will help you realize previously-untapped compute savings.

Table of Contents
  • What are Spot instances?

  • Why you should use Spot instances (hint: it’s not just because of cost savings)

  • EC2 Instance Pools Explained

  • Auto Scaling groups (ASGs) explained

  • ASGs and Spot instances

  • When you should use Spot instances

  • Testing Environments and CI/CD

  • Batch processing tasks

  • High-performance computing (HPC) and big data processing

  • Web servers

  • Containerized workloads / Kubernetes

  • Conclusion

What are Spot instances?

Spot instances are sort of like cheap, same-day flights that become available due to last-minute cancellations or unsold inventory. Airlines often reduce prices significantly to fill these empty seats quickly, before the flight takes off.

In the case of compute instances, cloud providers offer unused on-demand computing resources at much lower prices  compared to on-demand instances — up to 90% off — as a way to make use of their excess capacity.

You simply set a bid price for the Spot instance — the maximum you’re willing to pay per hour — and if the Spot price (the current price in the spot market) is below your bid price, the instance runs.

However, Spot instances can be reclaimed by the cloud provider with only a two-minute notice if the demand for regular-priced, on-demand instances increases, potentially interrupting your application.

Spot instances and flight tickets both present a chance to acquire something (computing power for the former, and flight tickets for the latter) at a reduced cost. However, there's a level of uncertainty and risk involved — Spot instances might be reclaimed if the market price rises above your bid, and last-minute flight tickets might vanish if someone else purchases them before you.

The two most common situations interruptions occur are when:

  1. There's a surge in demand for on-demand or reserved instances

  2. Spot prices rise above the bid (less likely now)

Why you should use Spot instances (hint: it’s not just because of cost savings)

While the potential EC2 savings (as high as 80%!) are often-touted as a major benefit of using Spot instances, it’s not the only benefit.

Using Spot instances doesn't inherently make applications more resilient, it often requires applications to already possess certain levels of resilience to effectively accommodate the potential interruptions associated with Spot instances. 

For example, because applications running on Spot instances should ideally be architected to handle interruptions gracefully without causing significant disruption to the overall system, you should’ve already designed it for checkpoints, auto-saving mechanisms, or distributing workloads across multiple instances.

That way, your infrastructure:

  1. Better handles fluctuations, 

  2. Maintains performance during peak times, and 

  3. Mitigates risks associated with potential interruptions or failures.

During peak loads, Spot instances can be integrated into your system to handle increased demand, ensuring that your system can accommodate fluctuations in traffic or workload without performance degradation.

With cheaper instances, you can allocate more resources toward redundancy and failover mechanisms, and distribute workloads across more instances. 

And if one instance experiences an interruption, other instances can continue processing parts of the workload, minimizing the impact of any single failure. This ensures that your workloads seamlessly shift to another instance without significant cost implications.

EC2 Instance Pools Explained

To best leverage AWS Spot instances, it's important to conceptually understand EC2 instance pools. An EC2 instance pool refers to the total capacity of an instance type (i.e. m5.xlarge) in a given region. 

When there’s unused capacity in an instance pool, that spare capacity is referred to as a Spot Capacity pool. 

Each instance family, instance size, availability zone, and region have distinct EC2 instance pools, and therefore Spot capacity pools.

As such, you shouldn’t “put all your eggs in one basket.” The more pools you tap into, the more diversified your potential instance selection will be — which minimizes the chances that Spot instances aren’t available for your application to use.

Building on this, we’re going to explore how to optimize Spot instance utilization with Auto Scaling groups (ASGs), and the nuances of Spot allocation strategies, providing insight into which strategy suits different scenarios best and why.

Auto Scaling groups (ASGs) explained

ASGs are a mechanism for managing groups of instances. They make your workloads more elastic by automatically adjusting the number of instances deployed based on demand, and help you enhance fault tolerance.

Scaling out when demand increases ensures performance and responsiveness, while scaling in during periods of lower demand helps reduce unnecessary costs

You can configure them for allowed instance types and availability zones, minimum and maximum limits on number of instances, and mixed instance policy (percent Spot instances vs. percent on-demand instances), among others. But once configured, ASGs automate much of the resource management.

For example, if you ran an ecommerce website, it would be much easier to respond appropriately to changes in web traffic with Auto Scaling groups.

Manually dealing with changes in traffic would require constant monitoring and quick responses to avoid crashes or slow performance, potentially impacting user experience and resulting in revenue loss.

ASGs and Spot instances

ASGs are especially important in the context of Spot instances because they help you handle Spot interruptions and optimize their utilization by automatically replacing interrupted Spot instances with new ones. 

While ASGs don't directly distribute workloads among instances, they ensure the desired number of instances is available. The distribution of incoming workloads or web requests across these instances is typically managed by services like Elastic Load Balancers. ASGs work with these load balancers to automatically direct traffic to each new instance they create, helping balance the workload distribution efficiently.

Additionally, they provide the flexibility of using different instance types within the same group. This helps you utilize various Spot instance types based on availability and cost.

When you should use Spot instances

In general, Spot instances are best suited for workloads that:

  • Are flexible,

  • Don’t have specific time requirements,

  • Are distributable / can be split into concurrently-running tasks, and 

  • Can tolerate interruptions

We’ll cover the specific use cases where using Spot instances make sense, but here are three questions for helping you figure out if your workloads are suitable for Spot instances:

  1. Are my workloads fault-tolerant?Since Spot instances can be interrupted, workloads must be designed to handle interruptions without causing a critical failure or data loss. 

    Fault-tolerant workloads can continue running or can quickly recover when instances are interrupted or terminated.

  2. Can the workload be stopped in < 2 minutes?Workloads must be stoppable within a short notice period to prevent data loss or disruption.

    If your workload can be stopped in less than two minutes, it becomes easier to respond to Spot instance interruptions.

    For this reason, stateless applications are well-suited for Spot instances, since they don’t store session data. This makes it easy for them to seamlessly migrate between instances without losing functionality or data, making them resilient to interruptions.

  3. Can I be flexible about instance types and availability zones?

    Distributing your workloads across multiple instances and availability zones reduces the vulnerability of your workloads to interruptions spreading the risk.  Remember, capacity is a property of a Spot instance pool. Each different instance type in each different availability zone is a separate pool. When you’re able to tap into more than one pool, the risk of interruptions in all the pool capacities at the same time is lower than the risk of an interruption in a single pool.

    Spreading across multiple availability zones decreases dependency on a single pool, ensuring continuity even if one zone experiences capacity constraints or price spikes.

More specifically, you should consider using Spot instances in the following situations.

Testing Environments and CI/CD

Testing/Dev environments and CI/CD tasks usually don’t need continuous uptime because they’re used intermittently to work on specific features or test changes. Additionally, development and testing tasks can be restarted, or paused and resumed (if planned ahead), without critical data loss, making them more tolerant of interruptions.

These workloads are often flexible in terms of resource requirements and can adapt to different instance types or availability zones without compromising the work being performed.

Batch processing tasks

Batch processing and ETL jobs oftentimes aren’t time-critical, allowing for flexibility that makes Spot instances a great fit.

These tasks can also be broken down into smaller, independent units that can be distributed across multiple instances without significant impact if an instance is interrupted. 

This way, the interruption of one instance doesn't hinder the completion of the entire job, as the workload can be distributed among other available instances. And if there aren’t available instances, jobs can be structured to save intermediate states, resuming  from the last checkpoint in case of interruptions.

High-performance computing (HPC) and big data processing

High-performance computing tasks involve handling and analyzing vast amounts of data. Spot instances make sense for these types of workloads because these tasks can be distributed across various instances and allow for easy scaling up and down. 

Typically these tasks are costly since processing large datasets requires substantial compute resources, but with Spot instances the cost of each instance is much lower — and with thousands of instances this adds up.

Web servers

Web servers are great candidates for Spot instances because they are usually stateless. They don’t typically store data locally or rely on information from previous sessions, and therefore they can be interrupted without significant impact.

In many cases with web servers, each request is processed independently without relying on stored session information.

Containerized workloads / Kubernetes

Containerized applications are often designed to be stateless, making them a good candidate for Spot instances.

Since containers don't usually store session-specific data, new containers can be spun up or shut down without affecting the overall system. 

Also, since containers divide applications into smaller, independent units, containerized workloads can adapt easily to different instance types or availability zones. This flexibility aligns perfectly with the variable nature of Spot instances.

Conclusion

We've covered everything you need to know about Spot instances — from their concept to leveraging their advantages effectively, and which use cases allow you to maximize their advantages. 

If you’re deploying EC2 instances with Auto Scaling groups, learn more about Spot Scaling here.

Matan Bordo

Matan Bordo

Product Marketing Manager @ DoIT

Matan is a Product Marketing Manager at DoiT focusing on product enablement and adoption, FinOps practices, and go-to-market strategies. Originally from California, he now lives and works in Tel Aviv.