APPLICATION OPTIMIZATION

The Rise and Fall of Azure Cloud Spend at
Sidechain – Our Journey

Back in 2022, we were all-in on the cloud. Like many growing companies, we invested heavily in building out our infrastructure in Azure. We moved workloads, spun up services, and architected for performance, flexibility, and scale. But what came next was something many companies experience (but few talk about): Cost Shock.

Like many companies after the Covid pandemic, adopting public cloud seemed the prudent thing to do. Little did we know, this would be a very sharp double-edged sword – one that cut deep into our financial and engineering resources, but that created an unexpected outcome as well.

We settled on Microsoft Azure as our cloud infrastructure provider. GCP was too idiocentrically “Google” for our needs, and AWS simply too complex. As we moved our workloads into Azure, we thought we were doing everything right. We used Reserved Instances where possible to gain savings, we were deliberate about what services we spun up, and we even limited the number of engineers that had access to spin up VM’s.

Our Azure costs rose, significantly and fast! But we rationalized those costs as the cost of business, and that we would recuperate them over time.

We also had the pricing scares that so many customers go through. We spun up a firewall service at a premium level that accidentally doubled our monthly bill, and required a plea from Microsoft to reverse the charges (which they did).

After 6 months of higher highs in our Azure costs, we finally said – enough is enough. This doesn’t make sense.

🔧 Four Strategies to Control Azure Costs

Ultimately, we used the following 4 strategies to get control of our Azure costs:

Find unnecessary spend and idle resources
Consolidate and optimize resources
Put Hybrid Cloud to Work!
Create monitoring automation to close the pricing feedback loop

Here’s exactly how we evaluated each of these strategies..

🧩 Find unnecessary spend and idle resources

We started with the easy stuff:

Find all idle virtual machines and evaluate whether they could be blown away
Remove all associated objects from idle VM’s (public IP’s, disks, snapshots, etc)
Find idle services and evaluate their existence
Collapse storage costs into centralized blob storage
Tune retention period of backups so we weren’t holding on to unnecessary data
Shift long-running VM’s to RI’s
Re-evaluate our most expensive services (for us, that was AKS – see below)

Our mantra in this exercise was: a dollar is a dollar.

What we meant by that is, we often found ourselves in the trap of thinking “that’s just a few bucks a month, it’s nothing” or “eh, $100 a month, that’s no big deal” and sure, within the grand scheme of things those low-dollar items didn’t really move the needle. But in aggregate, they do.

So we put every dollar on the table, whether it was a $1k / month Kubernetes cluster or a $3.70 / month public IP, we evaluated everything. We developed automated PowerShell scripts to collect spend data for us to help with this evaluation.

💡 Consolidate and Optimize Resources

We evaluated the kinds of workloads we had. Did we have services running on three nodes that could run on two? In some cases we had applications running on two nodes that could be consolidated to one. Take this example:

We operated a security application one two, pay-as-you-go VM’s each costing us about $350 / month all in: VM, disks, snapshots, etc. $700 / month = $8,400 / year
We consolidated on to one VM, and used a 1 year RI on it dropping the cost to $390 / month = $4,680 / year.
Savings: 45% annual savings, or about $4,500.

We found that if we could do that a few times, the savings were material.

In other instances, we were paying for self-hosting of services that could be farmed out to a cloud service much more cost effectively. Our SIEM (Security Information and Event Management) was an example of this.

🚀Put Hybrid Cloud to Work

We found that running hypervisors in our private cloud data center was always cheaper than Azure, particularly for complex services like Azure Kubernetes Service (AKS). The cost of ownership for running VM’s on our own gear was just cheaper.

So we optimized our private cloud. Here’s what we did.

We already had data center space because we operate an HSM-as-a-service business where we rack security appliances and operate them on behalf of customers.
We spent about $30k building out infrastructure to host internal applications. We would break even after about 6 months. That’s a no-brainer.
We run a lot of Kubernetes, and this was the perfect use for the private cloud infrastructure. Dwain migrated all of our Kubernetes workloads out of Azure and into the private cloud infrastructure.
We created a seamless hybrid network between Azure and our private cloud, so for us, it’s just one fabric – use what we want in Azure, use cost-effective private cloud resources where appropriate.

Net-net: We save about 30% cost on workloads we run in the private cloud environment.

🎯 Create Monitoring Automation

Before, we would usually just “use” Azure, and when the invoice came, have sticker shock, and try and retroactively fix things.

It was difficult for our engineers, because with the cloud you don’t always know the ramifications of your actions or how much stuff actually costs. Our engineers are experts at building stuff, not making cost-based business decisions.

Alex created a variety of PowerShell scripts that would monitor the state of objects in select subscriptions and essentially “project” the estimated cost for new objects. Every night the scripts would run, see what was created, and derive a future cost estimate.

We agreed with engineering that if there was added spend over a certain threshold, we had a quick conversation – mostly on Slack – about why it was added.

Some other monitoring-related things we did:

Developed a habit of using “time” tags on resources that were created. These tags were “day, week, month, perm, unknown”. That way, we could run scripts that looked for any resource running for more than a month, and if it had been tagged to run for about a month, we would know to check on that asset – did we still need it?
Look for wasteful spend. Scripts would run to find orphaned objects (i.e. public IP’s left over, unattached disks, etc..) and do a review of these after a certain period of time.
Quarterly asset review: Using automation to create a service dashboard, we would review what’s running in Azure, could we safe costs by migrating it to the private cloud, and what was the return-on-investment for running it in Azure.

🛡️ In Conclusion

Turns out, by adopting some of the principles of “FinOps” in our cloud management routine, we were able to come to peace with our cloud costs. We understand them, and they are deliberate. Additionally, we have a better framework for making cloud utilization decisions.

Several months ago, we were talking with a customer of ours about what we did in this area, and he asked us to help them in the same way. We were able to lower their Azure costs by 30% within a few months.

Since then, we’ve run this framework and automation for other customers, and while the savings have been striking, a more important discovery was made: everyone is overpaying for cloud.

We were, our customers were, and likely, you are too.

👉 Maybe together we can get control of your cloud costs. If you’d like to speak to a member of our team about an engagement, contact us at [email protected] or email me directly: [email protected].

APPLICATION OPTIMIZATION

The Rise and Fall of Azure Cloud Spend at
Sidechain – Our Journey

Partners

Careers

Support Portal

Get in Touch

APPLICATION OPTIMIZATION

The Rise and Fall of Azure Cloud Spend atSidechain – Our Journey

Related Posts

The Rise and Fall of Azure Cloud Spend at
Sidechain – Our Journey