Skip to main content

Command Palette

Search for a command to run...

Cloud Costs and Platform Engineering: Making the Right Thing the Default Thing

Updated
17 min read
Cloud Costs and Platform Engineering: Making the Right Thing the Default Thing

Cloud Costs and Platform Engineering: Making the Right Thing the Default Thing

Series: The Modern SDLC · Post 16 of 17 Post 15: Incident Management · Post 17: DORA Metrics →


Two problems compound each other in engineering organisations as they grow.

The first is cloud costs. Somewhere between ten engineers and fifty, cloud infrastructure spend stops being a line item someone notices occasionally and starts being a significant operating expense that finance asks about every month. The bill grows faster than the headcount. Nobody is sure exactly what's driving it. Teams provision conservatively because over-provisioning is safe and under-provisioning causes incidents. Savings opportunities are obvious in retrospect and invisible in prospect.

The second is duplicated platform work. Every product team builds their own CI pipeline, their own deployment scripts, their own observability setup, their own secrets management approach. Each implementation is slightly different, each is maintained by a different engineer, and each takes time that could go toward the product. The tenth time a team sets up Kubernetes RBAC from scratch, nobody is learning anything — they're just paying the same tax again.

These two problems are connected. A good Internal Developer Platform solves both simultaneously: it eliminates duplicated platform work across teams, and it makes the cost-efficient path the default path rather than a choice each team has to consciously make.

This post covers how to build both practices.


The one thing to remember

A platform team's job is to make other engineers faster. A FinOps practice's job is to make the cost of that speed visible. Together they ensure the organisation builds efficiently — not just quickly.


FinOps: cloud cost as an engineering discipline

FinOps is not a finance activity. It's an engineering practice. The team that incurs the cost should understand it, own it, and have the tools to optimise it. Centralising cloud cost management in a platform team or a finance function creates the wrong incentives — the people spending the money never see the bill, and the people seeing the bill don't know what created it.

The FinOps maturity model has three phases, and most teams skip the first:

Inform — make costs visible. Tag every resource by team, service, and environment. Set up cost dashboards per team. Configure budget alerts. Enable anomaly detection. This phase is a prerequisite for everything else. You cannot optimise what you cannot see.

Optimise — reduce waste and improve efficiency. Rightsize over-provisioned resources. Purchase reserved capacity for stable workloads. Eliminate idle and orphaned resources. Turn off non-production environments overnight. This phase is where the savings happen.

Operate — embed cost awareness into the development process. Track unit economics. Include cost impact in architecture reviews. Surface cost deltas in IaC PRs via Infracost. Make cost a first-class engineering concern alongside performance and reliability.

Most teams jump to Optimise and wonder why the savings don't stick — because without Inform, the feedback loop doesn't exist. Engineers make cost-impacting decisions without knowing the impact, and the optimisations made in one quarter are undone by new provisioning in the next.


The biggest savings opportunities

Cloud bills at scale have a relatively consistent distribution of waste. The categories below represent where most of the money is, roughly in order of impact.

Reserved instances and savings plans: 30–60% discount. The single largest saving available on any cloud provider. Committing to one or three years of usage on stable baseline workloads in exchange for a significant discount off on-demand pricing. AWS, GCP, and Azure all have versions of this. The prerequisite is rightsizing — committing to the wrong instance size locks in inefficiency for years. Rightsize first, then commit.

Spot and preemptible instances: 70–90% discount. Unused capacity sold at a deep discount in exchange for the possibility of interruption. Suitable for: batch jobs, CI/CD runners, ML training workloads, stateless workers with graceful shutdown handling. Not suitable for: databases, stateful services, anything with strict uptime requirements. CI runners on spot instances is one of the highest-ROI FinOps changes a team can make — CI workloads are batch, interruptible, and often represent significant compute spend.

Rightsizing over-provisioned resources: 20–40% reduction. Engineers provision conservatively, which is rational — over-provisioning is safe and under-provisioning causes incidents. The result is typical CPU utilisation of 10–20% on many instances. AWS Compute Optimizer, GCP Recommender, and tools like CAST AI analyse actual usage patterns and recommend appropriate instance sizes. Do this monthly, not once.

Eliminating idle and orphaned resources: 15–30% reduction. Stopped EC2 instances still charge for attached EBS volumes. Unattached load balancers accumulate hourly charges. Unused Elastic IPs, old snapshots, stale container images, forgotten RDS instances from projects that ended — these are invisible costs that accumulate over months. Automated cleanup tools (cloud-custodian, AWS Nuke for non-production accounts) find and remove them systematically.

Non-production environment scheduling: 65% of hours saved. A development or staging environment that runs 24/7 but is actively used for 8 hours on business days is running idle for 65% of its lifetime. Schedule it off overnight and over weekends. AWS Instance Scheduler, Kubernetes CronJobs to scale deployments to zero, Karpenter's node consolidation to remove idle nodes — the mechanism is easy; the discipline of implementing it consistently across all non-production environments is where most teams fall short.

Data transfer and egress optimisation: 10–25% reduction. Egress charges are invisible until they're enormous. Cross-AZ traffic within the same region, data replication to other regions, data leaving to the internet — all charged. Use VPC endpoints for AWS services (free transfer versus internet egress pricing). Keep services in the same availability zone where latency-sensitive calls happen frequently. Route static assets through a CDN rather than serving directly from origin.

Storage tiering: 5–15% reduction. Data written to S3 Standard that's never read again pays Standard pricing indefinitely. S3 Intelligent Tiering moves objects between storage classes automatically based on access patterns. Lifecycle policies move logs, backups, and artefacts to cheaper tiers after a defined period — most are never accessed after 30 days and cost a fraction of Standard pricing in Glacier.


Visibility and accountability: the team that spends the money sees the bill

The structural principle of FinOps is that the team incurring cloud costs should see those costs and own them. Centralised cost visibility creates the wrong incentives and the wrong expertise in the wrong place.

Mandatory resource tagging is the prerequisite. Every resource tagged with team, service, environment, and cost-centre. Enforced via IaC policy — Checkov rule that fails the plan if required tags are missing. Untagged resources are invisible waste; treat missing tags as a build failure the same way you treat a failing test.

Cost dashboards per team, showing monthly spend, trend, and budget status. Each team sets their own budget with engineering lead approval. Anomaly alerts go to the team Slack channel, not a centralised finance inbox. Engineers see the cost impact of their infrastructure decisions in the same context as their other operational metrics.

Unit economics are the metric that makes cost meaningful. Raw monthly spend is hard to reason about. Cost per API request, cost per active user, cost per transaction, cost per build — these connect infrastructure spend to business value. A rising cost per transaction while revenue is flat is a signal. A falling cost per user while scale grows is a success. Raw spend going up while unit economics improve is acceptable growth; raw spend going up while unit economics worsen is a problem.

Infracost in CI is the IaC equivalent of test coverage in code review. Every IaC PR gets a comment showing the monthly cost delta of the proposed change before it merges: "This PR will increase your AWS bill by $340/month." Engineers make cost-informed decisions at the point where they're making the change, rather than discovering the impact at end of month.


Kubernetes cost optimisation

Kubernetes clusters are typically the largest single infrastructure cost for teams running containerised workloads, and the optimisation levers are different from general cloud cost management.

Karpenter (AWS) replaces Cluster Autoscaler and is dramatically more efficient. The Cluster Autoscaler works with pre-defined node groups, scaling them up and down. Karpenter provisions individual nodes just-in-time for the pending pods, choosing the instance type that best matches the pod's resource requirements and constraints. The result is better bin-packing, more use of spot instances, and faster scale-out — typically 20–40% cost reduction compared to the Cluster Autoscaler.

Vertical Pod Autoscaler analyses actual CPU and memory usage and recommends (or automatically applies) request and limit adjustments. Engineers set resource requests once and move on; VPA keeps them accurate as usage patterns change. Run in recommendation mode first — observe the recommendations for a week before enabling auto-apply.

KEDA (Kubernetes Event-Driven Autoscaling) enables scaling to zero for workloads that don't need to run constantly. A queue consumer with an empty queue uses zero pods — and if Karpenter is managing nodes, zero pods for a workload means zero nodes for that workload. The cost saving for batch and async workloads can be substantial.

OpenCost or Kubecost allocates cluster costs to namespaces, teams, and services. Shows cost per deployment, idle cost (reserved but unused capacity), and efficiency metrics. Essential for chargeback in multi-tenant clusters and for identifying which workloads are driving disproportionate cost.

Namespace resource quotas cap total CPU and memory requests per team namespace. Prevents one team's misconfigured deployment from consuming all available cluster capacity and triggering a node scale-out that affects every other team's workload.


Platform engineering: the discipline of making other teams faster

Platform engineering is the practice of building and maintaining internal products that other engineering teams use to build, deploy, and operate their own services. The platform team's customers are internal engineers. Their product is developer experience.

The core insight: every product team building their own CI pipeline, their own Kubernetes manifests, their own secrets management, their own observability stack is duplicating effort that produces no product value. A platform team builds it once — well, with the benefit of focus and specialisation — and every team uses it. The time savings compound with team count.

At ten product teams, each spending two hours per week on platform maintenance, the platform team saves twenty engineer-hours per week. That's the equivalent of half an engineer's output, recovered every week, indefinitely. The platform team that produces this pays for itself.

The platform team must think like a product team. This is the failure mode that kills platform initiatives: platform teams that build what they think product teams need rather than what product teams actually need. The platform has a roadmap. The platform has user research — interviews with internal engineers, quarterly DevEx surveys, friction log reviews. The platform has a product manager. Internal engineers are customers with preferences, constraints, and alternatives (they can build it themselves if the platform doesn't serve them). Treating them as customers produces platforms that get adopted. Treating them as beneficiaries of central mandates produces platforms that get worked around.

When to form a platform team: when you have three or more product teams, each spending meaningful time on infrastructure concerns rather than product work. A rough heuristic: if 20% or more of engineering time across product teams goes to non-product infrastructure tasks, a platform team likely pays for itself within a quarter.


The Internal Developer Platform

An Internal Developer Platform (IDP) is the collection of tools, services, and workflows the platform team provides. It abstracts complexity — engineers use the platform to deploy services, manage secrets, and get observability without needing to understand Kubernetes YAML, Terraform modules, or Vault policies.

The layers of a well-built IDP:

Developer portal — Backstage (or Port, Cortex). The service catalogue listing every service, its owner, its documentation, its SLOs, its runbooks, and its on-call schedule. Software templates for new services — scaffold a new microservice with repository structure, CI pipeline, Dockerfile, Helm chart, and Grafana dashboard in one click. TechDocs for internal documentation hosted alongside the code it documents. The front door of the platform.

Self-service deployment — reusable CI/CD pipelines. Reusable GitHub Actions workflows (via workflow_call) or GitLab CI templates that product teams call from their own repositories. The pipeline includes all quality gates — tests, security scanning, SBOM generation, image signing — without each team maintaining those gates themselves. Teams focus on their application code; the platform maintains the delivery pipeline.

Infrastructure provisioning — Terraform modules and Crossplane. Curated Terraform modules with opinionated defaults: encryption enabled, backups configured, monitoring included, deletion protection on by default. Crossplane extends this to Kubernetes-native infrastructure provisioning — engineers request a database by creating a Kubernetes custom resource, and the platform provisions and manages it. Teams consume infrastructure without needing to understand Terraform.

Observability — pre-configured by default. New services inherit a Grafana dashboard, default SLO alerts, and log routing automatically from the service template. Engineers don't configure observability; they inherit it and extend it. The platform ensures every service has the baseline coverage; teams add service-specific instrumentation on top.

Security guardrails — policy-enforced defaults. OPA or Kyverno admission controllers enforce security baselines in the Kubernetes cluster: non-root containers, resource limits required, image signing verified, NetworkPolicy required. Engineers can't accidentally deploy an insecure workload — the cluster rejects it. Security becomes the default path rather than an opt-in decision.

Secrets management — abstracted and policy-enforced. Engineers request secrets through a defined process. External Secrets Operator syncs from Vault or AWS Secrets Manager into Kubernetes. Engineers never handle raw credentials, never manage secret rotation, never wonder whether a secret is current. The platform owns the secret lifecycle.


Golden paths: the opinionated, supported default

A golden path is the platform team's recommended, pre-built, well-supported way to do something common. It's not mandatory — engineers can deviate — but deviating means operating without platform support. The golden path is the path of least resistance to the right thing.

New service template — the most valuable golden path. A Backstage software template that scaffolds a new service in minutes: repository structure, CI pipeline, Dockerfile, Helm chart, Grafana dashboard, default SLO alerts, CONTRIBUTING.md, README. A new microservice is production-ready infrastructure in thirty minutes without any platform knowledge. Every service in the organisation starts from the same foundation rather than being rebuilt from tribal knowledge.

Deployment pipeline — a reusable workflow that implements the full CI/CD pipeline including build, test, security scanning, SBOM generation, image push, and GitOps manifest update. Teams call one workflow; the platform maintains it. When the security scanning tool changes, the platform updates the workflow once and every team gets the improvement.

Database provisioning — a Terraform module or Crossplane composite resource that provisions a database with platform defaults: encryption at rest, automated backups, deletion protection, monitoring, least-privilege IAM role. Teams provide a name and instance size; the platform handles everything else. Consistent, secure, and auditable across every database in the organisation.

Observability onboarding — adding two labels to a Kubernetes Deployment automatically provisions a Grafana dashboard, a Prometheus scrape config, default latency and error rate alerts, and log routing to Loki. Zero manual observability setup for teams adopting a new service.

The anti-pattern: golden paths so opinionated they can't accommodate real use cases, forcing teams to diverge constantly. A good golden path covers 80% of cases perfectly. The remaining 20% diverge with documented platform support — not a dead end. If the divergence rate is high, the golden path has gaps that need fixing.


Developer experience metrics: measure what you're trying to improve

Platform teams often have no metrics on whether their platform is making engineers faster. They build, they release, they assume adoption means value. DevEx metrics close this gap.

Time to first deploy — how long for a new engineer (or a new service) to reach production for the first time. The clearest single signal of platform friction. If it takes more than a day, the platform has a significant gap. Track it for every new hire. A downward trend means the platform is improving. An upward trend means it's getting harder to use despite investment.

CI/CD cycle time — time from commit to production. Directly measures the efficiency of the delivery pipeline. A rising trend signals pipeline degradation or process friction accumulation. Track per team and overall.

Developer satisfaction survey — quarterly, short. Three to five questions: "How much of your time goes to non-product work?" "How easy is it to deploy a change?" "What slows you down most?" Direct qualitative signal that quantitative metrics miss. Run it, share the results openly, and act on the top responses visibly — the visibility of action is what maintains survey participation over time.

Platform adoption rate — percentage of teams using the golden path versus custom pipelines. Percentage of services using the service template. Low adoption means the platform has friction, gaps, or trust problems. The right response to low adoption is user research, not mandates.

Toil ratio — percentage of engineering time spent on operational work versus product work. The platform team's job is to drive this down across all product teams. Target: less than 20% toil. Track via quarterly time surveys or ticket categorisation. A rising toil ratio is the signal that the platform isn't keeping pace with the organisation's operational needs.


What goes wrong when these practices are absent

The surprise bill. A misconfigured NAT gateway, a forgotten large instance, a runaway Lambda function generating millions of invocations — discovered at end of month when the bill arrives. With budget alerts and anomaly detection, these are caught within hours. Without them, they accumulate for weeks.

The platform tax. Every team spending 30% of their time on infrastructure concerns instead of product work. The cumulative cost across a ten-team organisation is three full-time engineers' worth of capacity — permanently diverted from building the product. A platform team that costs two engineers and recovers three engineers' worth of capacity across product teams is a net positive from day one.

The inconsistent platform. Twelve teams, twelve CI pipeline implementations, twelve approaches to secrets management, twelve sets of Kubernetes manifests with twelve different security configurations. An incident in one exposes a gap in all of them, but there's no mechanism to fix them consistently. Platform engineering eliminates this by making the platform a single maintained thing rather than a pattern copied imperfectly across teams.

The adopted-but-unused platform. A platform that product teams adopt because they're required to but work around because it doesn't serve their needs. The platform metrics show high adoption; the DevEx surveys show high friction. The platform team is building the wrong things because they're not treating engineers as customers.


If you do one thing from this post

Implement mandatory resource tagging across your cloud account this week. Pick four tags: team, service, environment, cost-centre. Add a Checkov rule that fails IaC plans with untagged resources. For existing untagged resources, use AWS Config or GCP Asset Inventory to find them and tag them retroactively.

Within thirty days, you'll have a cost breakdown by team and service that you can share in your next engineering all-hands. That breakdown is the starting point for every FinOps conversation that follows — "here's what we're spending, here's where it is, here's where we should look first" — and you can't have that conversation without it.


Next up: Post 17 — DORA Metrics: The Four Numbers That Tell You Whether Your Engineering Is Actually Getting Better

Post 15: Blameless Post-Mortems: How to Turn Outages Into the Best Learning Your Team Gets

The Modern SDLC

Part 17 of 17

Most engineering content teaches tools in isolation. This series connects them. From conception and architecture through to observability, incident management, and continuous improvement — a practical guide to how modern software is built, delivered, and operated end to end.

Start from the beginning

The Modern Software Development Lifecycle: A Field Guide for Engineers

The Modern Software Development Lifecycle: A Field Guide for Engineers Series: The Modern SDLC · Post 0 of 17 This is the starting point. If you found a later post first, start here. Most engineeri