GitOps: Making Deployment So Boring It Never Wakes You Up at 3am

GitOps: Making Deployment So Boring It Never Wakes You Up at 3am
Series: The Modern SDLC · Post 11 of 17 ← Post 10: Containers and Kubernetes · Post 12: Release Management →
The best deployment you ever do is the one nobody notices.
No last-minute testing panic. No all-hands deployment call. No Slack message asking "is anyone else seeing errors?" It happens automatically, it's over in minutes, and the only signal is a deployment marker appearing on your monitoring dashboard and a green notification in the team channel. By the time anyone checks, it's done.
This is what mature continuous delivery looks like. Not a dramatic event — a boring, automated consequence of merging code.
Most teams are nowhere near this. Deployments are manual, infrequent, and high-stakes. They happen on specific days, at specific times, after specific sign-offs, because the accumulated weight of each release makes deploying feel risky. The irony is that infrequent deployment makes risk higher, not lower — large batches of changes are harder to debug, harder to roll back, and more likely to interact in unexpected ways.
Continuous delivery and GitOps are the practices that break this cycle. This post explains how they work and how to build toward them.
The one thing to remember
Deployment stops being a manual, high-anxiety event and becomes a boring, automated consequence of merging code. The goal is not faster deployment — it's deployment that's so reliable it stops being worth thinking about.
CD, CI, and continuous deployment — the distinction that matters
These three terms get used interchangeably and they shouldn't. The distinctions matter because they describe different levels of automation with different requirements.
Continuous integration is what Post 7 covered: merge code frequently, build and test automatically on every push. The output is a verified, versioned artefact.
Continuous delivery means the artefact is always in a deployable state, and the deployment pipeline is automated — but a human may still approve the final step before production. The pipeline does the work; a person makes the call.
Continuous deployment means every passing build deploys to production automatically with no human gate. The most advanced form. Requires high test confidence, mature observability, and the cultural trust to let automation make production decisions.
The right target for most teams is continuous delivery — automated pipeline, optional human gate to production — and graduating to continuous deployment as confidence in the pipeline grows. Jumping straight to continuous deployment before your test suite has earned that trust is not boldness; it's skipping the foundation.
The GitOps model: git as the source of truth for everything
GitOps is the implementation pattern that makes continuous delivery reliable and auditable. The principle: the desired state of your entire system — application versions, configuration, Kubernetes manifests — lives in git. An automated operator continuously reconciles the real world to match what's in git. Nothing is deployed except through a git event.
The workflow looks like this:
Developer merges a PR — code lands on main
CI builds, tests, and pushes a versioned container image
CI opens a PR in the deployment repository updating the image tag
The PR is reviewed and merged (or auto-merged if the branch is main)
ArgoCD or Flux detects the manifest change in the deployment repo
The operator reconciles — deploying the new version to the cluster
Health checks verify the deployment
A notification appears in the team channel
The developer's interaction ends at step 1. Everything after is automated.
ArgoCD is the most widely adopted GitOps operator. It provides a web UI showing the current state of every application, the sync status between git and the cluster, and the history of all deployments. It supports automatic sync (apply immediately when git changes) and manual sync (require a human trigger). Rollback to any previous revision is a single CLI command or UI click.
Flux is the lighter-weight alternative — CLI-first, more composable, no built-in UI (though Weave GitOps adds one). Better for teams that want GitOps as an infrastructure primitive rather than a platform with opinions. The two tools solve the same problem and the choice usually comes down to whether you want a rich UI or prefer to stay in the terminal.
The app repo vs config repo separation is the practice that makes GitOps clean. Application source code lives in one repository. Kubernetes manifests and Helm values — the deployment configuration — live in a separate repository. CI writes to the config repo by updating image tags; the GitOps operator reads from it. This separation means the operator doesn't need access to source code, deployment history is visible in one place regardless of which service changed, and you can update configuration without triggering a code build.
Push-based vs pull-based CD: why pull wins
Most traditional CI/CD pipelines are push-based: the CI system, after building and testing, pushes a deployment to the target environment. It calls kubectl apply or helm upgrade or runs a deployment script directly.
This model has a structural problem. The CI system needs credentials for every environment it deploys to. Those credentials are stored in the CI system, which is a high-value target. A compromised CI system means a compromised deployment path to production. Additionally, there's no ongoing reconciliation — if someone manually changes the cluster state after deployment, the CI system doesn't know and won't correct it.
GitOps is pull-based. The operator runs inside the cluster and pulls desired state from git. It doesn't need to be reachable from outside. It doesn't expose credentials to the CI system. And because it continuously reconciles, drift from the declared state is detected and corrected automatically.
The security model is meaningfully better: your CI system needs credentials to push to a container registry and to update a git repository. It doesn't need credentials to access your Kubernetes cluster at all.
Progressive delivery: limiting the blast radius
Even with a fully automated pipeline, not every deployment should go to 100% of traffic immediately. Progressive delivery is the practice of gradually exposing a new version — with automated or manual gates at each step — so that if something goes wrong, only a fraction of users are affected.
Rolling updates are the Kubernetes default. Pods are replaced one by one. New pods must pass readiness checks before old pods are terminated. Both versions run simultaneously during the rollout. Zero downtime, but no control over traffic split and no automatic rollback based on metrics.
Blue-green deployments run two identical environments — blue is live, green receives the new version. Traffic switches instantly at a load balancer or DNS level. Blue stays live as an instant rollback target. The cost: double the infrastructure during the transition. The benefit: complete separation between old and new, instant rollback, and full testing of the new environment before any traffic hits it.
Canary releases route a small percentage of traffic to the new version — 1%, 5%, 25% — and monitor error rates and latency. If metrics look good, increase the percentage. If something looks wrong, roll back before the majority of users are affected. This is real production traffic validating the release, with limited blast radius.
Argo Rollouts automates progressive delivery. You define an analysis template — "the error rate of the canary must stay below 1% and p95 latency must stay below 400ms" — and Argo Rollouts increases canary weight automatically when analysis passes and rolls back automatically when it fails. The human sets the policy; the automation executes it.
Feature flags as a delivery mechanism are worth separating from the feature flag discussion in Post 5 because of what they enable for deployments. Code ships to 100% of pods behind a flag that's off. The feature is released when the flag is turned on, independently of the deployment. Rollback is turning the flag off — sub-second, no redeployment. The deployment and the release are completely decoupled events.
The expand/contract pattern: zero-downtime database changes
Application rollback is easy. Database schema rollback is hard. This is the unsolved problem that makes many teams nervous about frequent deployment, and it has a well-established solution that most teams haven't adopted.
The naive approach: add a column, deploy the new application version that uses it. The problem: during a rolling deploy, old application pods are running alongside new ones. If the old application version doesn't understand the new column, you have a compatibility problem.
The expand/contract pattern solves this with three sequential deployments instead of one:
Expand: Add the new column as nullable, with no constraints. The old application version runs fine — it ignores the column it doesn't know about. No incompatibility.
Migrate: Deploy the new application version that uses the new column. Backfill existing rows. Both versions are now compatible with the schema.
Contract: Add constraints, remove the old column, clean up. Only run this deploy after the previous one has been stable long enough that you're confident you won't roll back.
Three deployments instead of one, but each is safe to roll back independently. The pattern generalises: any schema change can be decomposed into additive (backward-compatible) steps followed by cleanup steps. The rule is that every schema migration must be backward-compatible with the previous application version — if it is, rollback is always safe.
Rollback: it must be instant, tested, and boring
A rollback procedure that has never been tested, takes thirty minutes to execute, and requires someone who knows the system is not a rollback procedure — it's a hope. Rollback should be a one-command operation that any engineer on the team can execute without help.
GitOps rollback is the cleanest: revert the commit in the config repository that updated the image tag. The GitOps operator detects the change and deploys the previous version. Full audit trail. Same process as a forward deployment.
ArgoCD rollback keeps a history of all deployed revisions. Roll back to any previous revision from the UI or with argocd app rollback my-app 3. No git operation required — though the git history won't reflect it, so following up with a git revert is good practice.
Feature flag rollback is the fastest: turning a flag off takes milliseconds and requires no deployment at all. For any high-risk feature, a feature flag kill-switch should be the first response to a production problem.
The most important rollback practice: define before you deploy what would trigger it. "If p95 latency stays above 500ms for more than five minutes after deploying, roll back" is a clear, pre-committed decision. "We'll see how it looks" is a decision made under pressure with impaired judgment. Pre-committing the rollback trigger removes the political difficulty of admitting a deployment went wrong.
DORA metrics: the scoreboard for delivery performance
The DevOps Research and Assessment programme identified four metrics that reliably predict organisational performance. They're the closest thing the industry has to an objective measure of delivery health.
Deployment frequency — how often you deploy to production. Elite teams deploy multiple times per day. This metric is a forcing function: achieving high frequency requires small PRs, trunk-based development, automated pipelines, and good test coverage. Low frequency is almost always a symptom of multiple other problems.
Lead time for changes — time from commit to running in production. Elite teams measure this in under an hour. Long lead time means slow feedback on whether a change works in production, slow response to bugs, and slow delivery of value to users.
Change failure rate — the percentage of deployments that cause a production incident requiring rollback or hotfix. Elite teams stay under 5%. A rising change failure rate while deploy frequency increases is the signal that your quality gates aren't keeping pace with your delivery speed.
Mean time to restore (MTTR) — how long to restore service after a production incident. Elite teams recover in under an hour. This measures the combination of observability maturity (detect quickly), runbook quality (respond effectively), and rollback capability (mitigate fast).
The 2023 DORA report added a fifth metric: reliability — whether teams consistently meet their SLO targets. This was added specifically because high deploy frequency without maintained reliability is not elite performance.
The critical misuse to avoid: using these as management KPIs to compare teams. DORA metrics are a health signal for a team about itself over time, not a ranking system. Teams that are measured on deploy frequency inflate it with trivial commits. Teams that are measured on change failure rate become risk-averse and ship less. Use them as a diagnostic tool, not a scorecard.
What goes wrong when CD is broken
The deployment event. Infrequent, high-ceremony, high-stakes deploys. Every deployment is a production incident waiting to happen because every deployment is a large batch of accumulated changes being released simultaneously. The solution is smaller, more frequent deployments — not better runbooks for large deployments.
Push-based CI with cluster credentials. The CI system has kubectl access to production. A compromised CI pipeline — through a malicious dependency, a leaked secret, or a supply chain attack — is a compromised production environment. GitOps's pull-based model eliminates this attack surface.
No drift detection. Manual changes to the cluster accumulate silently. The next automated deployment conflicts with the manual changes in unpredictable ways. Debugging takes hours because nobody knows what state the cluster was actually in before the deployment.
Rollback that's never been tested. The first time a team discovers that rollback takes forty-five minutes and requires tribal knowledge is during a production incident. Test rollback procedures in game days before they're needed in emergencies.
DORA metrics as a ranking system. Engineering managers comparing teams on deployment frequency creates gaming — trivial deployments to inflate the number — and destroys the signal. The metrics exist to help teams improve, not to rank them.
If you do one thing from this post
Separate your application repository from your deployment configuration repository if they're currently the same. Create a config repo that holds your Kubernetes manifests or Helm values. Update your CI pipeline to open a PR against the config repo when a new image is built, rather than deploying directly.
This one change does three things: it makes deployment history visible in one place, it enables GitOps tooling like ArgoCD or Flux, and it removes the need for your CI system to have cluster credentials. The operational improvement arrives before you've fully adopted GitOps — and the foundation is there when you're ready.
Next up: Post 12 — Release Management: How to Ship Without Fear
← Post 10: Containers and Kubernetes: What They Actually Are and When You Actually Need Them




