E03: 3 Strategies to Maximize Cloud Cost Savings

Nick and Jason discuss 3 actionable strategies for you to manage your cloud spend. First, companies must first clean up their environment and remove irrelevant hardware and software. They also go into detail on getting the most discount from cloud providers and why refactoring apps is a wise ongoing process to realize cost savings.

Listen On:

Show Notes

Hey, what's up everyone? This is Nick with 10 nasty cloud.com with my co-host Jason, and you're listening to the Cloud Cost optimization podcast. More people than ever are building cool stuff in the cloud and spending a whole lot of money to do it. Today we're gonna talk about some of the strategies for you know, managing costs in the cloud and managing your cloud spend.

So I think just to, to kick it off Broadly, I think there's, there's kind of three approaches that, that we agree from a, a cloud operations perspective as to, you know, how do you kind of get your arms around spending and how do you, you know, reduce that spending or, or kind of maximize the potential savings that you have.

The, the first and most obvious is, is cleaning up the environment, right? I mean, this is in my experience, Jason's experience, we'll talk about this in just a moment. You know, it, it's almost a hundred percent universally true that there's things to clean up. It's also almost a hundred percent universally.

Right. It's almost a hundred percent universally true that there's also somebody in the, in the team or the entire team going, No, [00:01:00] no, no, no, no. Everything is absolutely necessary. And that is also 100% true . And so, you know, it's, it, it might not be costing you a lot of money, but it's either costing you a lot of money.

or it's gonna cost you a security headache. It's one of the two. It's not, It's, it's typically not a benign thing. Right. Well it's, it's a snowball effect too, you know, of Yeah. Yeah. I have a few of these things hanging out, but over time, you know, it just, the, the amount of waste in the environment starts to build up.

The amount of misconfiguration starts to build up. Well, it's, it's the same problem you and I have been dealing with for years, man. It's the same problem. Yep. I. Virtualization, which we've talked about on the podcast. You know, like, remember when we would just, you were like, Wait a minute, Why do I have like 30 extra virtual machines on this cluster?

You know, what, what are these things here? They just, it, it, it is the same problem at a different scale that. It has been plagued with, for many years it was physical servers before you'd get a collection of those. And next thing you know, you got hundreds of [00:02:00] physical servers. Why do I need all these? And half of 'em are, you know, the applications aren't even really being used anymore.

Or if they're being used, they're not being used very much. Absolutely. And then it went to virtual servers and now it's just the same problem in a different area. And it's, you know, now you got more people affecting what's going into the pot now. Whereas before it was a, Team of people, you know in it.

Now every developer has access to the cloud environment to spin up and spin down stuff. Everybody in the network infrastructure team has the ability to do it. Same problem, different stuff. We'll dive into that, dealing with it for years. And guess what? 10 years or we're gonna be talking about the same shit.

Sorry. We're gonna dive into that in just a second. But we also. You know, second strategy which, which we'll get into is really around commitments in managing your discount programs, cuz those are available from, from each of the big three. And, you know, there's some subtle differences in, in, you know, how to leverage them across different players.

We'll go much deeper on that in, in future episodes, but, Definitely wanna touch on sort of the, the strategy there. And then the final strategy is really around, you know, refactoring those legacy [00:03:00] apps and, and those things in the environment that, that could take advantage of newer and cheaper technologies and probably work better in the process.

So, so, you know, let's, let's head right back to that, that first one where we were, we were diving in there, I mean, man, I, I, I, you know, I, we could probably share some horror stories current ones, but, you know, one, one that I like. To talk about especially with, with new folks on any time I have 'em on my team, running operations is, you know, one of the first one of the first organizations I worked in was a massive hospital and I was assigned to upgrade a bunch of their.

Network stack. And so I had to go out to these data closets and, and install a bunch of new equipment. But in those racks was, you know, probably about three foot of old equipment that was still stacked in there. And when I inquired about, you know, are we removing that the answer was like, No, no, no, no, no.

You gotta leave that alone. There's this one thing, this one app, this one tool, this one thing down in the OR that runs off of that. So we have to leave it in there, and there's [00:04:00] one single wire running from this one stack to the next, to the next, to the next. And this new thing is going in. It's gonna take all the connections except for that, you know, one or two necessary things.

And I, you know, I, I think that translates into what's happening in cloud today. It's not physically there. You can't walk in the closet and touch it and see it, and it's a reminder and you eventually do solve that problem. A lot of times in cloud, you know, it's software, so it's just kind of ignored. You don't even really see it.

I, I, you know, it's the same concept. It's just not in physical form, right? I mean, it, it, it is the same thing. We just had a, we had a customer that, that I know it's, this is a very similar story, right? It was They had an AWS account that had one server in it, and nobody knew about this server because it was account that was created like years ago before they had their, their entire infrastructure moved to AWS and that account was closed.

And then like two weeks. [00:05:00] And it's an application that gets used rarely, right? It's just one of those things that's been out there. Close the account down because you know nothing was in it, right? Well, nothing was in it. There was something in it, but it's just you have one account with one server in it.

What region is it in? Where is it located? What did you know? Like it, you don't log in and see here's all your resources. You just don't see stuff like that, right? And so, we deleted it, . Or they deleted it. And but fortunately enough, you can bring that stuff back. And it wasn't that big of an app. It wasn't that like critical of an application, but it's the same exact problem.

It's just you don't have a closet, you can go look at you, you, it. It's almost like trying to find a file. in Google Drive or one drive without the ability to search. So, so, and, and, and you didn't put it there. So you have , you don't even have the name of said file, you just have what is on it. Right. And you don't know what it is.

And it's like trying to find that. [00:06:00] So it's, it's a needle in the haystack. Whereas a, you know, in, in our years prior, you could go into a closet and look at what you had and. Put a console on it and see what was running. I can't do that. You know, it's, it's important cause you could ask the question, you could actually walk to something and ask the question.

I'm not saying that was a better day. That was not a better day. We were in far better times now with the ability to, to, you know, treat infrastructure as code and, and, you know, Really include infrastructure in our software. That is a far better world, and I wish I had had that 20 years ago.

But you know, the same problem kind of persists. It, it exists and, and it's not just the kind of the thing that's left out there and it's, it's still being used. You know, we've, we've run into really egregious cases, or I have personally in my career, really egregious cases, the things that aren't used and they're just hanging out.

I remember a case where an an old cluster had been decommission. But all the workloads had been left on it. And so actually that hardware was never decommissioned, still consuming power. It still needs cooling. It still is connected to all the systems and all the storage and everything that, that was necessary to keep it up and [00:07:00] running.

But it absolutely had no use just because, you know, the, the, the people who had been decommissioning or had been assigned that project got far enough along that it was you. At least decommissioned from a business perspective, but it wasn't actually decommissioned from an an actual operational perspective.

And so I think you see that in the cloud world too where, you know, there are times when we go into consult and find, you know, thousands of abandoned resources that just frankly aren't being used anymore. There. I love, I love this phrase, it's always response. Well, we kept that just in. I think just in case is responsible for more bad cloud spending, and, and more wasting companies than, than any other issue across the board.

Well, remember we found one, one of the most agree, we usually find a significant amount, but this is one that sticks out in my head, was $200,000 of additional [00:08:00] spend per year in just ebs Yeah, volume. That were not attached to any instance that didn't have any relevant data. All it was, was a process that the developers used when they were testing that wasn't configured properly, didn't, wasn't set to delete.

The EBS volume on termination of these test instances they had, it was $187,000 of annual spend that they had. It was almost 10,000 EBS instances that were not connected to something. And when, when, when our system reported that, they opened up a ticket and said, Our system is wrong, . The, the, the, the, the infrastructure guy said, Nope.

This is wrong. You're reporting in an error without even going to look at it. Right? Because then I, you know, of course I'm calling the engineering team. I'm like, something's wrong with, you know, like we gotta go verify all this stuff to make sure, because I assumed, and we went and verified it, everything was accurate.

We had all the raw data went, I sent, we sent the email back. [00:09:00] And lo and behold, five minutes later he said, You're right, we have 9,800 EBS volumes that need to be deleted. That's not an uncommon story. That's probably to the nth degree, right? Having 10,000 of one specific thing, but you know, which is far easier to.

To go suss out then 10,000 of 30 different AWS services potentially. Yeah. I'm taking back to, you know, one, one of my one of my mentors in my career made the comment. Mm-hmm. And when you're, when you're a software developer, this, this is one of those things that rings true is, is just because you've been able to do it faster doesn't mean that it's successful.

So you might actually be automating. And you know, that is, that is one of the cases where, you know, code can bite you as well. It definitely happens. So why don't we talk about dude, And it's, and it happens to us, it happens to the best of 'em. But that's why we have tools in place to help us do that because our engineers automate stuff that they shouldn't automate [00:10:00] your en every company we've been in, it's like, why'd you, why'd you automate that?

Wh where was the manual process? We didn't have. What'd you automate then? , you know, how do you know what to automate if there's no manual process for it first, You know? So, you know, that's a common scenario, right? I mean, that's, nobody's, everybody has that problem. So let's touch on commitments. Cause we only have a little bit of time left.

You know, really this is the second vehicle. Once, once the environment's sort of cleaned up, you get the configuration correct. And, and again, we'll, we'll dive into that in, in another episode. All, all the various ways. That you can clean up the environment and what to look for, but, but then when you get to the, you know, the ability to actually get discounts on the environment due to commitments you know, there are a number of options here and we often see people get stuck in, you know, analysis paralysis or at least have a great deal of uncertainty around what they should do.

What are your thoughts, Jason? Well, it's just as confusing as the billing itself, right? The commitments. Commitments are procured based on [00:11:00] utilization or spend, right? You're, you're committing to something. So it's difficult to without tools, it's difficult to predict what those things are gonna be one or three years, which are.

Typically the options you have into the future. But the good thing about these commitments is you can save a lot of money on the most expensive parts of cloud infrastructure. But, and, and, and they make it really easy for you to save a certain amount , right? So it's like, you know, they make it super easy for you.

10%. Right? And that's because that's what they want you to save. Because they know if you really did the work, you could save north of 50% byi by, by a combination of commitments and optimization and all these other things. So, I would say it, it follows the same pattern as, as. You know, not having a tough time understanding your costs, cloud usage, billing and the data around it is extremely complex.

And understanding how your usage and spend [00:12:00] patterns correlate to the right savings commitments. Is a task that really a machine has to provide. And then you know, again, the, you know, providers do a pretty good job of saying, Here are all your options. There are a hundred of them. You choose the best one that's that's right for you.

Right? So they, we've given you all these things. Now you go choose and choosing is where. The hard part comes in because what combination of those is gonna provide me the biggest discount I see. The one that you're telling me you're recommending, it's the one that saves me 10, 12% on my bill. How do I unlock those real savings?

40, 50%. And you can't do that without a deep understanding of the data behind your usage and spend back, you know, I think's a, I think a place where we've seen a lot of, of people, or a lot of organizations, a lot of users get this wrong is, you know, is this. One and done, set it and forget it? Or is this a, you know, ongoing process that, that needs to continue to live and breathe in your management of your, your cloud spend?

[00:13:00] I would say, yeah. Well, I mean, we know the answer, right? It's not one and done, but. It, and again, this is, I don't know why it, it, This is one of those things that's confusing to me because with everything about your cloud environment, your spend, your utilization, that's all a living, breathing thing. Why would this savings piece of it be any different than that?

You, you don't set and forget your cloud infrastructure from a security perspective, do we? We just set it. We set it up, forget it, don't monitor it, don't do anything to it. Right? That's what we do to provide the most secure infrastructure. No, it's not. We have to constantly manage it. We have to constantly monitor it.

We have to react to changes in usage, patterns, in spend, patterns, in additional resources and additional applications and additional, I mean, it's a living, breathing thing. So if you are that lone company that has a static environment that never changes, maybe you don't need a tool to do this, but if you're anything like us and the millions of other [00:14:00] businesses out there that utilize Cloud infras, It is a living, breathing animal.

Therefore, all the management and monitoring around not only security and a, and visibility and, but around costs as well as what savings instruments you're implementing, all need to be managed. I think it's a, I think it might be a holdover from, you know, the, the legacy buying the buyers, you know, honestly haven't changed.

We, we haven't had that much time go by and I get my 10% discount on my loan, and I buy once every three. Right. So, you know, I, I, I think it might be a holdover but it is, you're right. It's a, it's ongoing. I think you're this is a good, good concept. But yeah, it's once you understand what they do, how they.

Implement these savings commitments, you realize pretty quickly that this is an ongoing management. Yeah, yeah, absolutely. Especially, and we'll dive into this another episode, but especially when you start trying to, you know, wrap your head around, Well, do, am I gonna commit to use or am I commit to spend?

And, and why should I commit to one or the other? What do I need to look at to understand and build a strategy? And [00:15:00] again, we'll, we'll, We could do episodes. What are my usage and SP patterns telling me what's my baseline? What's this like? It's just like if you don't have something to help you manage these commitments, and again, they're very powerful, just like cloud infrastructure, right?

They're very flexible, just like cloud infrastructure. They're extremely complex and confusing. Just like cloud infras. It's all the same. So I think you know, the final one we'll just touch on, but again, we, we can do whole episodes and we'll probably bring on guests to talk about this. Cuz this is a, this is a very broad topic, is, is then you get to the, the refactor component, right?

Of really thinking about, you know, where where can my applications take advantage of greater advantage of public cloud. You know, where can I start to use services to reduce my spending, control my costs, manage it. Better based on the uses, patterns of, of the app, et cetera. So, you know, really we get into this kind of, this topic of refactor.

And of course there's, there's, there's many different areas we could dive into here from kind of refactoring legacy monolithic apps all the way down to kind of, even the. [00:16:00] Ongoing refactoring that happens, We see it over and over with our own engineering team of, of how you rethink even your own cognitive app because you find a way that you can take advantage of, of cost differences in the environment and do things slightly differently in order to drive down costs.

We, we've, we've done that ourselves. We've, we've refactored our own application to cut in half our cloud costs. Because we discovered ways that we could, we could take advantage of infrastructure in different ways. It's just, it's, it's a reality of learning and understanding and things changing in, in real time.

I think the important part of that is the only way we were able to do that was because of the way we designed our application and how it leverages serverless infrastructure in a lot of ways. Obviously we still have some dedicated compute because it's hard to eliminate all of that, but it's. Fraction of our, our total environment.

Yeah, I mean, listen, if, if that stuff was running on a physical server, we'd be spending the same amount of day that we were right [00:17:00] before that. Right? It would, wouldn't, wouldn't have cost us any less. You know, but, but the problem, well, the difficulty is that, It requires re it requires time from developers.

You probably don't have, because we watch the news, there's a shortage out there and , especially in the, in the IT realm. So it takes time. But it is. . One of the best ways, if not the best way to optimize and save costs in the cloud is to refactor. It does mean , that you need to make sure you really have a handle on how you're monitoring your environment from a spend perspective, because you're starting to turn your usage, your your your actual usage.

What affects your bill? So right, like running a, a function, a lambda function, or a function in Azure. If that's up, if that spins up for, you know, 300 seconds or whatever it [00:18:00] is, and then spins down, I'm billed on that 300 seconds. See, So, you know, and we've seen this before, right? Some, a developer pushes a change to a LADA function, which.

Run 24 hours a day and next thing you know, their bills $10,000 and you could have hundreds of Lambda functions. So you can save a ton of money by going that route, but you definitely need to make sure that you're monitoring, you have visibility into what those things are doing. Absolutely. Well, I think we're at we're at time for today, so, that's it.

Tune in for some of our future episodes where we dive into each of these topics. We're gonna be doing that very soon in, in pulling apart the different, you know, ways that, that you all. Better manage and, and optimize your cloud environments. Thank you all for listening.