Can I store checkpoints on slower disks?

melissa@backupchain · 12-04-2021, 05:13 PM

You might be considering storing checkpoints on slower disks, and it’s a thought that many folks in the IT world entertain. There’s a bit of nuance here, and I think it’s worth unpacking.

When you’re working with checkpoints—especially in environments like Hyper-V or VMware—you rely on these snapshots to capture the state of your systems at specific points in time. The idea is that if something goes wrong, you can revert to a previous state. Checkpoints can be a lifesaver, especially in scenarios such as testing new applications or rolling back after a patch fails.

Now, using slower disks for storing checkpoints can be a double-edged sword. On one hand, it may seem convenient. Slower disks, such as traditional spinning hard drives, are often less expensive and can provide ample storage space. On the other hand, performance issues can arise, and these can impact your overall workflow.

Let’s talk about performance first. If you store checkpoints on a slower disk, you might experience increased latency during operations that reference those checkpoints. When a VM has a checkpoint, it doesn’t necessarily just pull data from the primary VHD file. Instead, it also has to access the differential files that are created when a checkpoint is made. When you run a VM with a checkpoint on faster disks, the read and write speeds are optimal, which means your operations executed on the VM can proceed smoothly.

If you’re pulling data from a slow disk, however, you may notice slower response times. For example, if I’m running a web application housed on a VM with a checkpoint stored on a slow disk, you could experience lag when users are trying to access the site. This can deteriorate the user experience significantly.

Let's take a real-life scenario that I encountered. I was managing a Hyper-V environment where a colleague had set up checkpoints on slower, older drives. While everything seemed fine at first, as VM usage increased, users began reporting performance issues. In monitoring, it became clear that transactions were delayed, and there were increased I/O wait times. Essentially, we were just bottlenecked by the slow storage. Once I migrated those checkpoints to solid-state drives, or SSDs, everything turned around. The difference was palpable, resulting in a noticeable improvement in application performance.

It’s also crucial to consider backup strategies. In any setup, checkpoints shouldn’t be confused with backups. While checkpoints can help in immediate recovery scenarios, they don’t replace a robust backup solution. BackupChain, as an example, is used effectively to manage backups without relying solely on checkpoints. This software integrates well with Hyper-V, allowing for regular backups even while checkpoints are in play. Using a dedicated backup solution ensures that your data is preserved at a more reliable cadence than what you can achieve with checkpoints alone.

Again, if you’ve decided to use those slower disks for checkpoints, you may want to make sure you have an efficient backup strategy designed around that. If the slow disks fail, you could risk losing critical checkpoints along with all the data contained in those snapshots. I’ve seen environments where not having a solid backup means that admins had to recover entire systems from scratch, which can be a nightmare situation to manage.

Another aspect to consider is the total available resources of the host machine. I run into this frequently when working with environments where memory and CPU resources are overcommitted. If the host doesn’t have enough horsepower, then the impact of slower disk storage gets compounded. The I/O operation from the slower disks can lead to saturation of the system’s resources, making it hard for the VM to function optimally. Application crashes may occur, and troubleshooting becomes cumbersome.

You may also want to think about how often you create checkpoints. Frequent checkpointing can exacerbate the issues with slower disk speeds. Generally, fewer, more strategic checkpoints would be the ideal approach. If you’re doing constant checkpointing on a VM with a backup strategy that includes those slow disks, the interrupted performance can lead to a backlog of processes that need to get handled, resulting in more latency.

On the flip side, you could also consider partitioning your storage approach. Think about keeping your active workloads on faster SSDs while relegating faster-access checkpoints to slower disks that aren’t dealing with critical operations. You'd be amazed at how effective this type of tiered storage can be in avoiding bottlenecks.

A key consideration is your workload characteristics. If you are running applications that do not have heavy I/O demands, you might find that the slower disks can handle checkpoint operations without significantly impacting performance. However, if you’re running databases or any application requiring high I/O throughput, it’s likely that slower disks will become a pain point.

In environments using technologies such as Hyper-V, each VM can be sized and configured according to its needs. If you have VMs that don’t require the speed of SSDs but are doing important operations, checkpointing might not be as adversely affected. Still, I'd argue it would benefit any operation to err on the side of caution. Test the waters with a VM on a slower disk, but always have a plan to escalate to faster storage if issues emerge.

Finally, let’s not dismiss the potential of hybrid setups. These can be engineered to combine the best of both worlds. If budget permits, using a mixed setup where some checkpoints are kept on faster disks and others on slower disks can offer flexibility in managing your resources.

If your organization maintains strict policies on performance and uptime, you might want to think twice before embracing slow disks for your checkpoints. By keeping an eye on performance benchmarks, system logs, and user feedback, you can monitor whether the impact is being felt in tangible, undesirable ways. Technology isn’t static, and how you choose to manage your storage will likely evolve alongside the capabilities of your infrastructure. Keeping all these aspects in mind will help in making a more informed decision.

My advice is that while using slower disks for checkpoints can be done, it is crucial to carefully weigh the performance implications and the potential risks involved. There’s a delicate balance to maintain between cost and operational efficiency. You’d be better off investing in a tiered storage model or maintaining enough fast storage to ensure that those critical checkpoint operations run smoothly.