How to Build a CI Pipeline That Engineers Actually Trust

How to Build a CI Pipeline That Engineers Actually Trust
Series: The Modern SDLC · Post 7 of 17 ← Post 6: Testing Strategy · Post 8: DevSecOps →
A CI pipeline is the nervous system of your delivery process. Every change flows through it. Every engineer depends on it. And like a nervous system, you only notice it when something is wrong.
The difference between a CI pipeline engineers trust and one they don't isn't usually the tools. It's the signal quality. A trusted pipeline is one where a green build means something — where passing CI is genuine evidence that the code works, not a formality to get through before merging. An untrusted pipeline is one where engineers have learned to merge despite red builds because the failures are usually noise, or to wait forty-five minutes for a result that doesn't tell them much.
Once a team learns to ignore CI, the pipeline stops being a quality gate and becomes a delay. Rebuilding that trust after it's lost is harder than building it correctly in the first place.
This post covers how to build a CI pipeline that earns and keeps that trust — fast, reliable, and genuinely informative.
The one thing to remember
A CI pipeline has one job: give every engineer fast, accurate feedback on every change. Fast means under ten minutes for the blocking gate. Accurate means green genuinely means good and red genuinely means broken.
If your pipeline fails either of those, you have work to do regardless of how sophisticated the tooling is.
Pick a platform and stop maintaining your own
The first decision is where your CI runs. The answer for most teams in 2024 is a hosted platform, not a self-managed Jenkins instance.
Self-managed Jenkins was the default for a decade and created a generation of engineers whose relationship with CI is primarily adversarial. Plugin conflicts, Java heap errors, agent configuration drift, security patches that nobody has time to apply — maintaining Jenkins at any scale is a part-time job. For most teams, that job isn't worth doing.
GitHub Actions is the default recommendation if your code is on GitHub. Native integration with your repository, a massive marketplace of reusable actions, generous free tier for public repositories, and a YAML-based workflow format that's learnable in a day. The ecosystem maturity means that for almost any task — container builds, cloud deployments, security scanning, release automation — a well-maintained action exists.
GitLab CI is the natural choice for GitLab repositories. Tightly integrated with the platform, excellent pipeline visualisation, and a strong option for enterprises needing self-hosted infrastructure without the Jenkins overhead.
Buildkite occupies a specific niche: agents run on your own infrastructure, the control plane is hosted. This gives you custom hardware (important for teams doing GPU workloads or needing specific compliance guarantees), the speed of local agents, and the operational simplicity of not managing the orchestration layer. Expensive but worth it at scale.
Dagger is worth knowing about for teams frustrated by the "works locally, fails in CI" problem. Pipelines defined in code — TypeScript, Go, Python — that run identically in your terminal and in CI. The local debugging story is genuinely better.
Whatever you choose: define your pipeline as code, committed in the repository alongside the code it builds. A pipeline configured through a web UI is undocumented, unversioned, and impossible to reproduce when something goes wrong.
The five stages of a modern CI pipeline
A well-structured pipeline runs in stages, with each stage only starting if the previous one passed. Earlier stages are faster and cheaper; later stages are slower and more thorough. Failing fast on cheap checks avoids wasting time running expensive ones.
Stage 1: Validate (under 2 minutes)
The first stage catches the obvious problems before anything heavier runs. This is what gives engineers fast feedback on the most common classes of mistake.
Linting and formatting checks run the same configuration as the pre-commit hooks. If formatting is wrong, fail loudly — don't auto-fix in CI. The developer should fix locally and push again. Auto-fixing in CI creates commits that weren't authored by the developer and creates confusion about what was actually reviewed.
Type checking runs as a separate job so type errors appear immediately without waiting for tests. tsc --noEmit, mypy, pyright — fast, high signal, often the first thing that tells an engineer their change broke something.
Secrets scanning on every push. Gitleaks or Trufflehog scanning the diff for credential patterns. If a secret is detected, the pipeline stops immediately. This is the backstop for the pre-commit hook — it catches credentials that were committed with --no-verify or on machines where the hook wasn't installed.
Commit message validation if you're using conventional commits — and after reading Post 3, you should be. Validates the format of PR titles or commit messages to ensure automated changelog generation will work downstream.
Stage 2: Test (2–8 minutes)
Tests are the most time-expensive stage. The goal is maximum signal in minimum time, which requires parallelisation, caching, and intelligent test splitting.
Parallel test execution is the highest-leverage optimisation. A ten-minute test suite split across five runners becomes a two-minute test suite. Most CI platforms support matrix builds or test splitting natively. Jest's --shard flag, pytest-xdist, Vitest's parallel mode — these split test suites across runners with minimal configuration.
Dependency caching recovers two to five minutes on every run by not reinstalling packages that haven't changed. Cache node_modules keyed on package-lock.json hash, pip wheels keyed on requirements.txt hash, Go module cache keyed on go.sum. The cache invalidates when dependencies change and is reused when they don't. A cache hit should take under thirty seconds.
Coverage gates on changed lines, not total coverage. Total coverage gates create perverse incentives — teams write tests to hit the number rather than to cover meaningful behaviour. Diff coverage gates — requiring that new and changed lines meet a coverage threshold — prevent new code shipping untested without demanding retroactive coverage of legacy code that was never tested.
Test result reporting in JUnit XML format, published to the CI platform. This surfaces failing test names directly in the PR interface. Engineers don't hunt through log files to find what broke — the failed test appears as a named annotation on the PR.
Flaky test detection tracked over time. BuildPulse, Trunk Flaky Tests, or GitHub Actions' built-in flakiness detection. Any test that fails more than once in ten runs without a code change is flaky. Quarantine it automatically. A flaky test in your blocking suite is worse than no test — it trains engineers to ignore red builds.
Stage 3: Security (parallel with Stage 2)
Security checks run in parallel with tests to avoid adding to total pipeline time. The most expensive mistake in pipeline design is running security and tests sequentially when they're independent.
SAST (Static Application Security Testing) scans source code for security vulnerabilities. Semgrep with a security ruleset is fast and configurable. CodeQL provides deeper analysis for GitHub repositories. Block only on high and critical findings — blocking on informational findings creates alert fatigue that trains engineers to dismiss the scanner.
Dependency CVE scanning flags known vulnerabilities in your dependencies. Snyk, OWASP Dependency-Check, or the built-in npm audit / pip audit. Block on critical CVEs with available fixes. Report medium and low without blocking.
IaC scanning on every Terraform or Pulumi change. Checkov or tfsec catch misconfigurations before they're applied: public S3 buckets, overly permissive IAM roles, unencrypted storage volumes. Discovering a misconfiguration in a PR review is free. Discovering it after a security audit is not.
Licence compliance for codebases where it matters. FOSSA or licence-checker flags GPL licences in proprietary code, missing attribution requirements, and licence incompatibilities. The legal risk of shipping a licence violation outweighs the inconvenience of checking.
SBOM generation — a Software Bill of Materials listing every dependency and version — produced by Syft or CycloneDX. This is increasingly required for enterprise customers and regulatory frameworks (SOC2, FedRAMP). Generate it in CI rather than as a manual process.
Stage 4: Build (only after stages 1–3 pass)
The build stage produces the deployable artefact. It only runs after all prior quality gates have passed — there's no point building something that failed its tests.
Semantic versioning from commits using semantic-release or release-please. The version number is derived from conventional commit messages automatically — feat: commits bump minor, fix: commits bump patch, feat!: or BREAKING CHANGE: commits bump major. No manual version bumps, no arguments about numbers, no forgetting to update a version file.
Multi-stage Docker builds keep images lean. One stage installs build dependencies and compiles. A separate minimal stage contains only the runtime artefact. A Node.js application that starts as a 1.2GB node:20 image becomes a 120MB distroless image. Smaller images pull faster, scan faster, and have smaller attack surface.
Pin base images by digest. FROM node:20-alpine@sha256:abc123 instead of FROM node:20-alpine. Tags are mutable — the same tag can point to a different image tomorrow. Digests are immutable — they pin to a specific image content hash and make builds reproducible.
BuildKit layer caching in CI via the cache-from / cache-to flags pointing to a registry backend. A cold Docker build that takes eight minutes becomes a warm build that takes under a minute when only application code changed and the expensive layer of dependency installation is cached.
Multi-platform builds for both linux/amd64 and linux/arm64 using docker buildx. ARM instances (AWS Graviton) are meaningfully cheaper than their x86 equivalents for many workloads. Building for both architectures in CI means you can deploy to either without rebuilding.
Container image scanning with Trivy or Grype before pushing to the registry. Catches OS-level CVEs introduced by the base image — vulnerabilities that SAST wouldn't see because they're in the runtime environment, not the application code.
Stage 5: Publish (main branch only)
Publishing only runs on merge to main, after every prior stage has passed. This is the handoff point between CI and CD.
Push to the container registry with a version tag, a branch tag, and a git SHA tag. The version tag (myapp:1.4.2) is the stable reference. The git SHA tag (myapp:a3f9d12) provides complete traceability — from the running container back to the exact commit that produced it.
Artefact signing with Cosign and Sigstore provides cryptographic proof that the image was produced by your CI pipeline and hasn't been tampered with in the registry. Keyless signing using OIDC means no key management. Consumers can verify the signature before deployment.
GitOps manifest update — open a PR or push a commit to the deployment repository updating the image tag in the Kubernetes manifests. This triggers ArgoCD or Flux to reconcile the cluster. The CI system doesn't need cluster credentials. The deployment system watches the repository.
Release notes auto-generated from conventional commits and published as a GitHub or GitLab release. Attached SBOM as a release artefact. Slack or Teams notification with: what changed, who merged it, the new version, a link to the release notes, and a link to the deployment dashboard.
Pipeline as code: the structure that scales
Your pipeline definition lives in the repository alongside the code it builds. This is non-negotiable — a pipeline configured through a web UI is undocumented and unreproducible.
For GitHub Actions, a structure that works across most projects:
.github/workflows/
ci.yml ← runs on every push and PR
release.yml ← runs on merge to main
scheduled.yml ← nightly security scans, dependency updates
pr-checks.yml ← PR title lint, size check, label enforcement
The ci.yml job structure:
jobs:
validate: # lint, typecheck, secrets scan — parallel
test: # unit + integration, sharded across runners
security: # SAST, CVE scan, IaC scan — parallel with test
build: # only if validate + test + security pass
publish: # only on main branch, only if build passes
Reusable workflows are worth investing in once you have more than one repository. GitHub Actions' workflow_call trigger lets you define a workflow in a central .github repository and call it from any other repository. Update the central workflow once and the change propagates everywhere. This is how platform teams maintain CI standards across a fleet of services without requiring each team to maintain their own pipeline.
The performance targets worth defending
Under 2 minutes for the validate stage on every push. This is the feedback loop that tells engineers immediately whether their commit is sane. If it takes longer, the pre-commit hooks need work or the validate stage is doing too much.
Under 10 minutes for the full CI run on a PR. This is the feedback loop that determines whether engineers wait for CI or start something else. Above ten minutes, context switching begins. Above twenty minutes, it's standard to have multiple parallel branches in flight, which creates the kind of chaos trunk-based development is designed to prevent.
Zero tolerance for flaky tests in the blocking suite. One flaky test that gets triggered three times per day means engineers see a red build three times per day for no reason. After a week, they've stopped reading red builds carefully. After a month, you've lost CI.
100% of artefacts versioned and signed. Any artefact that isn't versioned is an artefact you can't roll back to reliably. Any artefact that isn't signed is an artefact whose provenance you can't verify.
What goes wrong when CI is broken
The forty-five minute pipeline. Engineers stop waiting for it. PRs merge without a green build. The pipeline becomes a formality that slows things down without catching anything, because nobody waits for the results.
The chronic flaky suite. Red builds become noise. Engineers merge on red because "it's probably just the flaky test." Eventually a real failure gets merged because it looked like flakiness. The first time that causes a production incident, the team has to rebuild trust in CI from scratch.
The inconsistent environment. CI passes but the code fails locally, or vice versa. Usually a dependency version mismatch, an environment variable that exists in CI but not locally, or a service that's mocked differently. The fix is environmental parity — lock your dependencies, use dev containers, make local and CI as identical as possible.
Security as theatre. SAST and CVE scanning configured to never fail the build. Findings appear in reports that nobody reads. The security gates exist on the dashboard but produce no change in behaviour. This is worse than not having the gates — it creates the impression of security practice without the substance.
The click-configured pipeline. Someone built the pipeline through the web UI three years ago. Nobody knows exactly how it's configured. When it breaks, diagnosing the problem requires access to the CI platform's admin UI. When the platform changes a feature, the pipeline silently stops working. Pipeline as code prevents all of this.
If you do one thing from this post
Measure your current pipeline's end-to-end time. Not the time in the CI system's logs — the clock time from when a PR is opened to when a green or red result appears in the PR interface.
If it's over ten minutes, you have a feedback loop problem. Pick the single slowest job in the pipeline and investigate why it's slow. Usually it's one of three things: tests that could be parallelised, dependencies that could be cached, or a job that's doing sequential work that could be parallel. Fix that one job. Then measure again.
The discipline of treating pipeline performance as a product quality metric — something you track, something you improve, something that has a target — is what separates pipelines that get faster over time from pipelines that get slower.
Next up: Post 8 — Shift Left: How to Make Security Every Engineer's Job
← Post 6: The Testing Trophy: Why You're Probably Writing the Wrong Tests




