Infrastructure as Code: Treat Your Cloud Like a Codebase

Series: The Modern SDLC · Post 9 of 17 ← Post 8: DevSecOps · Post 10: Containers and Kubernetes →

There's a class of production incident that's uniquely demoralising. The service goes down. The team investigates. The code is fine. The tests passed. The deployment worked. The problem is that someone changed a security group rule in the AWS console three weeks ago, and nobody documented it, and the change interacted with something else in a way nobody anticipated, and now the system is broken in a way that takes hours to trace back to a manual infrastructure change that nobody remembers making.

This is the snowflake problem. A snowflake server — or a snowflake cloud environment — is one that's been manually configured over time into a unique state that can't be reproduced, can't be audited, and can't be recovered reliably after failure. Every manual change is a divergence from any documented state. Every undocumented divergence is a potential incident.

Infrastructure as Code is the answer. Not because it eliminates all infrastructure problems, but because it turns infrastructure from something you configure and hope nobody changes into something you define, review, version, and automate — the same way you treat application code.

The one thing to remember

If it was clicked in a console, it doesn't exist. Infrastructure only exists when it's defined in code, reviewed like code, versioned like code, and applied automatically like code.

What IaC actually solves

The benefits of treating infrastructure as code compound over time in ways that aren't obvious upfront.

Reproducibility means spinning up an identical staging environment takes minutes, not days. Disaster recovery means running terraform apply against your last known good state, not trying to remember what was configured two years ago.

Auditability means every infrastructure change is a git commit with an author, a timestamp, a message, and a PR review trail. Compliance auditors asking "who changed this security group and when?" get an answer in seconds instead of a multi-day investigation.

Environment parity means dev, staging, and production are defined by the same code with different variable values. The "works in staging, fails in prod" class of problems caused by configuration drift stops happening.

Drift prevention means the moment someone makes a manual change outside of IaC, it's detectable. Scheduled plan runs that produce non-empty diffs are alerts that the real world has diverged from the declared state. Drift is treated as a bug, not an acceptable operating condition.

Cost visibility means infrastructure changes have a cost delta before they're applied. Infracost, integrated into CI, posts the monthly cost impact of every IaC PR as a comment. "This PR will increase your AWS bill by $340/month" is information that should exist before a change is merged, not appear as a surprise at end of month.

Terraform: the default choice and why

Terraform is the most widely adopted IaC tool for a reason. The declarative model — you describe the desired state, Terraform calculates what needs to change — is intuitive. The provider ecosystem covers every major cloud and hundreds of services. The community knowledge base is deep. Terraform's failure modes are well-documented and the tooling around it is mature.

The killer feature is the plan/apply workflow.

terraform plan produces an exact diff of what will be created, modified, or destroyed — before touching anything. This diff is reviewable in a PR the same way a code diff is. No surprises at apply time. The team sees exactly what will change and why, and can catch mistakes before they're in production.

terraform apply executes the plan. If the plan was reviewed and approved, the apply is a deliberate, documented action. If something unexpected happens at apply time, it's because the real world changed between plan and apply — which is itself a signal worth investigating.

A few Terraform practices worth getting right from the start:

Remote state with locking. Terraform tracks the real world in a state file. This file must never live in git — it contains secrets, it changes on every apply, and concurrent applies without locking corrupt it. Use S3 + DynamoDB for state locking on AWS, Terraform Cloud's managed state, or GitLab's built-in state management. This is non-negotiable from day one.

Modules for reuse. Encapsulate common infrastructure patterns — a standard RDS instance with backups and monitoring, a VPC with standard subnet layout, an ECS service with autoscaling — into versioned modules. Teams consume the module by providing name and size; the module handles security groups, IAM roles, encryption, and monitoring by default. Copy-pasting 200 lines of HCL between services is not reuse — it's duplication that will drift.

prevent_destroy on stateful resources. Add lifecycle { prevent_destroy = true } to databases, S3 buckets, and other resources that would cause data loss if destroyed. A misconfigured resource name or an accidental terraform destroy shouldn't be able to drop a production database. The protection forces a deliberate decision — you have to remove the lifecycle block, plan, review the destroy, and apply — which is the right amount of friction for that action.

Pin provider versions. An unpinned AWS provider can introduce breaking changes when Terraform initialises. Pin provider versions in required_providers and update them deliberately, not automatically.

OpenTofu is worth evaluating for new projects. HashiCorp changed Terraform's licence to the Business Source Licence in 2023, which restricts use in competing products. OpenTofu is the Linux Foundation fork that remains genuinely open-source under the Mozilla Public Licence and is API-compatible with Terraform. For teams starting fresh, it eliminates the licence uncertainty without sacrificing any functionality.

Structure: how to organise IaC that scales

IaC that works at small scale often becomes painful at medium scale because the structure wasn't designed for growth. A structure that scales from a handful of services to dozens:

infra/
  modules/                    ← reusable, versioned building blocks
    ecs-service/
    rds-postgres/
    vpc-standard/
    s3-bucket-standard/
  environments/
    dev/
      main.tf                 ← calls modules with dev-sized values
      variables.tf
      terraform.tfvars        ← dev-specific values
      backend.tf              ← remote state configuration
    staging/
    prod/                     ← same structure, production-sized values
  shared/
    dns/                      ← resources shared across environments
    iam-roles/
    ecr-repositories/

The key principle: each environment directory has its own state file. This means a terraform apply in environments/dev cannot affect environments/prod. Changes promote through environments explicitly — the same module, different variable values — rather than through a single state that spans everything.

Keep prod infrastructure in a separate directory with a separate state file and stricter access controls. A misconfigured terraform apply in dev should be unable to touch prod. This is a structural guarantee, not a policy one.

GitOps for infrastructure: git as the source of truth

GitOps extends IaC with a workflow: the desired state lives in git, and an automated process continuously reconciles the real world to match it. No manual terraform apply from a developer's laptop. No ad-hoc changes that bypass review.

The workflow:

Engineer opens a PR with an IaC change
CI runs terraform plan and posts the diff as a PR comment
Infracost posts the monthly cost delta as a PR comment
Security scanning (Checkov, tfsec) runs on the changed IaC
Reviewer approves — reviewing the plan output, not just the code
PR merges
Automated tooling applies the plan to the target environment
Scheduled drift detection runs nightly and alerts on any divergence

Atlantis is the most common open-source implementation of this workflow for Terraform. It runs as a server, listens for PR events, runs plans on open, applies on merge, and posts results as PR comments. Self-hosted but straightforward to operate.

Terraform Cloud and Scalr are hosted alternatives. Lower operational overhead than running Atlantis, with additional features like run history, cost estimation, RBAC, and Sentinel policy enforcement.

The critical principle: nobody applies Terraform from their laptop with their own credentials. Every apply goes through the automated pipeline, is triggered by a git event, is logged, and produces an audit trail. An engineer who applies from their laptop bypasses review, creates untracked state, and produces a git history that doesn't reflect what's actually deployed.

Drift detection: treating manual changes as bugs

Once you've established IaC as the source of truth, manual changes become a reliability risk. Someone modifies a security group through the AWS console to debug an issue. Someone adds an S3 bucket manually to unblock themselves. These changes are invisible to Terraform, unreviewable by the team, and may be overwritten or conflicted with on the next apply.

Drift detection is the practice of running terraform plan on a schedule — nightly is common — and alerting if the plan is non-empty. A non-empty plan means the real infrastructure has diverged from the IaC definition. That divergence is a bug that needs to be resolved: either import the manual change into Terraform (if it was intentional and correct), or apply the Terraform state (if the manual change was incorrect or temporary).

The response to drift matters as much as detecting it. When drift is found, investigate why it happened before simply correcting it. Was the IaC process too slow for an operational need? Was the module missing a capability that forced a manual workaround? Was someone operating outside the process because they didn't know about it? Each cause has a different fix, and "someone made a manual change" is a symptom of a process gap, not a one-off event.

Policy as code: guardrails before provisioning

IaC lets you catch security misconfigurations at PR time — before a misconfigured S3 bucket or an overpermissive security group ever exists in the real world. Policy as code is the practice of encoding your security and compliance requirements as automated checks that run against IaC before it's applied.

Checkov is the most comprehensive tool for this. Over a thousand built-in rules covering Terraform, CloudFormation, Kubernetes manifests, and Dockerfiles. Common catches: S3 buckets with public access enabled, security groups open to the world on port 22 or 3389, RDS instances without encryption, IAM policies with wildcard permissions, Lambda functions with excessive privileges.

tfsec and Trivy (which now covers IaC scanning alongside container scanning) are faster alternatives with a smaller but more focused rule set. Good for teams that find Checkov's output volume overwhelming.

OPA (Open Policy Agent) with Rego is for teams that need custom policies beyond what built-in rules cover. Write your own rules: "all S3 buckets must have a team tag," "no RDS instances in production may have deletion protection disabled," "EC2 instances may only use approved AMIs." These are your organisation's specific requirements, codified and automatically enforced.

Infracost deserves a mention in this section. It's not a security tool but it's a policy tool: a cost policy that blocks PRs which would increase infrastructure spend by more than a defined threshold without explicit approval. "This PR will increase your monthly bill by $1,200" is information that should trigger a conversation, not silently land in next month's invoice.

Testing IaC: it's code, treat it like code

Untested IaC modules get copy-pasted, drift from their original intent, and fail in unexpected ways when applied to a new environment. The same discipline that applies to application code applies here.

terraform validate and terraform fmt -check are the minimum — run in CI on every PR. They catch syntax errors and formatting drift before anything more expensive runs.

terraform plan in CI goes further: it runs a real plan against real infrastructure (or a test environment), showing what the module would actually do. Reviewing the plan output is as important as reviewing the code.

Terratest (Go-based) and terraform test (the native testing framework introduced in Terraform v1.6) let you write tests that apply real Terraform, make assertions against the resulting infrastructure, and destroy everything afterward. Slow, but the right tool for shared modules that many teams depend on — a breaking change to a shared module is an incident for everyone who uses it.

Plan-based assertions are a faster middle ground: parse terraform plan -json in CI to assert on planned changes without creating real resources. "This module must create exactly one RDS instance." "This module must not create any publicly accessible resources." Fast, cheap, and catches a class of mistakes that review alone misses.

What goes wrong when you skip IaC

The snowflake environment. Every environment is unique. Staging works differently from production in ways nobody fully understands. Incidents in production can't be reproduced in staging because the environments aren't the same. Disaster recovery is a multi-day exercise in remembering what was configured.

The audit failure. A compliance audit requires demonstrating who changed what infrastructure and when. Without IaC, the answer is a multi-week manual investigation of cloud provider logs, if they've been retained. With IaC, it's a git log.

The state file in git. Terraform state committed to git is a security incident waiting to happen. State files contain sensitive values — database passwords, API keys, private IPs. They also cause concurrent apply corruption. Remote state with locking is non-negotiable; this is the most common Terraform mistake and the most consequential.

The manual console change that overwrites IaC. An engineer makes a manual change to fix an incident. Two weeks later, someone runs terraform apply for an unrelated change. Terraform's plan shows the manual change being reverted — but nobody notices because they're not reading the plan carefully. The system breaks again. The root cause takes days to find.

The monolithic state file. All infrastructure in a single state file means every terraform apply locks the entire account, every plan takes minutes because it must refresh all resources, and every state corruption affects everything. Split state by environment and by service from the start.

If you do one thing from this post

Find one piece of infrastructure in your system that was created manually and define it in Terraform. Not the most complex thing — the simplest. An S3 bucket. A security group. A DNS record.

Import the existing resource with terraform import, write the resource definition, run terraform plan, and verify the plan shows no changes. You've just brought one piece of infrastructure under version control.

That's the starting point. Migrate one resource at a time. Eventually — and it usually takes less time than teams expect — everything is managed as code, everything is reviewable, and the console becomes a read-only dashboard rather than a place where changes happen.

Next up: Post 10 — Containers and Kubernetes: What They Actually Are and When You Actually Need Them

← Post 8: Shift Left: How to Make Security Every Engineer's Job

Infrastructure as Code: Treat Your Cloud Like a Codebase

Infrastructure as Code: Treat Your Cloud Like a Codebase

The one thing to remember

What IaC actually solves

Terraform: the default choice and why

Structure: how to organise IaC that scales

GitOps for infrastructure: git as the source of truth

Drift detection: treating manual changes as bugs

Policy as code: guardrails before provisioning

Testing IaC: it's code, treat it like code

What goes wrong when you skip IaC

If you do one thing from this post

Comments

The Modern SDLC

Containers and Kubernetes: What They Actually Are and When You Actually Need Them

More from this blog

Containers and Kubernetes: What They Actually Are and When You Actually Need Them

Shift Left: How to Make Security Every Engineer's Job Without Making It Nobody's Job

How to Build a CI Pipeline That Engineers Actually Trust

The Testing Trophy: Why You're Probably Writing the Wrong Tests

Command Palette

Infrastructure as Code: Treat Your Cloud Like a Codebase

The one thing to remember

What IaC actually solves

Terraform: the default choice and why

Structure: how to organise IaC that scales

GitOps for infrastructure: git as the source of truth

Drift detection: treating manual changes as bugs

Policy as code: guardrails before provisioning

Testing IaC: it's code, treat it like code

What goes wrong when you skip IaC

If you do one thing from this post

Comments

The Modern SDLC

Containers and Kubernetes: What They Actually Are and When You Actually Need Them

More from this blog