The Testing Trophy: Why You're Probably Writing the Wrong Tests

The Testing Trophy: Why You're Probably Writing the Wrong Tests
Series: The Modern SDLC · Post 6 of 17 ← Post 5: Development Practices · Post 7: CI Pipeline →
Most engineering teams have a testing problem. Not the problem you might expect — not too few tests, though that's common. The problem is tests that don't catch the bugs that actually reach production.
The pattern looks like this: a team has 80% code coverage, a green CI pipeline, and a reasonable suite of unit tests. Then a production incident happens. Someone looks at the failing code. There are tests for it. The tests pass. The bug exists because the mock used in the unit tests behaved differently from the real dependency it was mocking. The tests were testing the mock, not the behaviour.
This is the most expensive kind of testing investment — the kind that creates confidence without providing it.
The shift from the classic testing pyramid to the testing trophy changes this. It's not a radical rethink of testing — it's a recalibration of where to put the effort, based on where bugs actually originate and where tests actually catch them.
The one thing to remember
The value of a test is proportional to how closely it resembles the conditions under which the code runs in production. A test that uses a different database, a different network, and different data from production is testing a different system.
The testing pyramid vs the testing trophy
The classic testing pyramid says: write lots of unit tests, some integration tests, and a few end-to-end tests. The shape reflects cost — unit tests are cheap, end-to-end tests are expensive — and the advice was sound when it was written.
The problem is what happened in practice. Teams took "lots of unit tests" seriously and built large suites of isolated unit tests with heavy mocking. They tested each function in isolation, mocking every dependency. The tests were fast, they were deterministic, and they provided almost no signal about whether the system actually worked.
The testing trophy, popularised by Kent C. Dodds, rebalances the investment:
- Static analysis and type checking at the base — instant feedback, catches a class of bugs before any test runs
- Unit tests — the next layer, for pure logic and algorithms
- Integration tests — the largest layer, the bulk of the testing value
- End-to-end tests — at the top, covering critical paths only
The key shift is that integration tests — tests that exercise multiple real components together — become the primary investment, not unit tests. The reason is simple: most production bugs don't live in pure functions. They live at the boundaries — between your code and the database, between your code and the API, between your service and another service. Unit tests with mocked boundaries can't catch those bugs. Integration tests with real boundaries can.
Static analysis: the tests you get for free
Before writing a single test, you have access to a testing layer that's instant, comprehensive, and costs nothing to run: static analysis and type checking.
TypeScript in strict mode, mypy or pyright for Python, the Go compiler, rustc — these tools catch entire categories of bugs before any code executes. Null pointer dereferences, type mismatches, function calls with wrong argument types, unreachable code — all caught at the point of writing, not at the point of running.
The discipline of keeping types accurate is worth the investment. Types are documentation that the compiler verifies. A function signature processOrder(order: Order): Promise<Receipt> tells you what the function accepts, what it returns, and that it's asynchronous — without reading the implementation. A function signature processOrder(order: any): any tells you nothing and catches nothing.
Linters operating as security tools add another layer: ESLint security plugins catching common injection patterns, Semgrep rules detecting known-bad code patterns, Bandit scanning Python for security antipatterns. These run in seconds and catch issues that code review misses because reviewers aren't systematically checking every line against a catalogue of vulnerability patterns.
The practical advice: configure your type checker in strict mode from day one. Retrofitting strict typing onto a codebase that was written without it is a months-long project. Building with it from the start is a minor constraint that pays back continuously.
Unit tests: for pure logic, not for everything
Unit tests are fast, deterministic, and excellent at a specific job: testing pure logic in isolation. They're the right tool for business rules, algorithms, data transformations, state machines, edge cases, and error handling.
They're the wrong tool for testing code that interacts with external systems — databases, HTTP endpoints, file systems, queues. When you mock those interactions in unit tests, you're not testing your code against the real system. You're testing your code against your assumptions about the real system. Those assumptions are where the bugs live.
The AAA pattern — Arrange, Act, Assert — should shape every unit test. Set up the inputs, call the code, verify the output. Tests that don't follow this shape are usually doing too much. A test with three Act sections is three tests. A test with no clear Arrange section is probably relying on shared state that makes the test order-dependent and fragile.
Test naming is specification. should_return_error_when_email_is_invalid tells you exactly what behaviour is being verified, what input triggers it, and what the expected outcome is. When this test fails, you know immediately what broke without reading the test body. test_email_validation tells you none of those things.
Mocking discipline. Mock at the boundary — at the database, the HTTP client, the file system. Don't mock your own code. A unit test that requires eight mocks to set up is a signal that the unit being tested has too many dependencies or is doing too much. The complexity of the setup reveals complexity in the design.
Property-based testing is underused and highly valuable for the right problems. Instead of writing individual test cases with specific inputs, property-based tests define properties that should hold for all inputs and use a framework (fast-check for JavaScript, Hypothesis for Python, QuickCheck for Haskell) to generate hundreds of random inputs automatically. "This function should always return a positive number" or "parsing and serialising should be round-trip stable" — these properties, tested against hundreds of generated inputs, catch edge cases no human would think to write a test for.
Integration tests: where most of the value lives
Integration tests exercise multiple real components together. Your code against a real database. Your HTTP handler against a real server. Your queue consumer against a real broker. This is where the bugs that reach production originate, and this is where the testing effort should be concentrated.
The biggest objection to integration tests has historically been speed and complexity — spinning up a real database for tests is slow and requires infrastructure. Testcontainers eliminates this objection. It provides libraries for Java, Go, Node.js, Python, .NET, and others that spin up real Docker containers — a real Postgres, a real Redis, a real Kafka — in your test suite, run the tests, and tear them down afterward. The container starts in a few seconds. The tests run against the real database engine. The gap between what the tests verify and what production runs is closed.
This is the single highest-ROI testing investment most teams aren't making. Teams that adopt Testcontainers consistently report that an entire class of production bugs — ones that only appeared because the mock behaved differently from the real database — stop happening.
What to cover with integration tests:
Your data access layer against a real database — including the actual SQL queries, the actual schema, and real data. ORM queries that look correct can produce wrong results with real data distribution. Query plans that perform well in isolation can degrade with realistic data volume.
Your HTTP handlers with a real server — the full request/response cycle including headers, middleware, authentication, and serialisation. A handler test that mocks the request object is testing a different path than the one your users hit.
Your message consumers with a real broker — including retry logic, dead letter queues, and ordering guarantees. Queue behaviour is notoriously hard to mock accurately.
Run real migrations in tests. Your integration tests should run against a database schema produced by your actual migration scripts, not a test-specific schema that you maintain separately. This catches migration bugs before they reach production and ensures your tests reflect the real schema rather than a version of it.
Test data management needs a clear strategy. Each test should create its own data and clean up after itself. Tests that share mutable state produce order-dependent failures that are the hardest category of test failure to diagnose. Factories or fixtures that create valid, realistic test data are worth building early — they pay back in every test you write after.
Contract tests: verifying service boundaries without end-to-end tests
In a distributed system, end-to-end tests across services are slow, brittle, and expensive. But something has to verify that the contract between a service and its consumers is honoured. Contract tests fill this gap.
The consumer-driven contract model (implemented by the Pact framework) works like this: the consumer defines what it expects from the provider — the request format, the response shape, the status codes. That expectation is captured as a contract. The provider's CI pipeline verifies that it can satisfy all registered consumer contracts independently, without the consumer being deployed or running.
The result: a break in the API contract is caught before deployment, in the provider's CI pipeline, not during an end-to-end test run or — worse — in production. The consumer and provider can be tested independently, and the contract is the specification that connects them.
When contract testing is most valuable: multiple teams consuming one API, mobile clients where you can't force upgrades, external partners who depend on your API stability, event-driven systems where schema drift is invisible until something breaks.
OpenAPI contract testing is the simpler alternative when Pact is more than you need: validate every API response against its OpenAPI specification in integration tests. If the response shape drifts from the spec, the test fails. This catches accidental breaking changes early and keeps the spec as a reliable reference for consumers.
End-to-end tests: cover critical paths, then stop
End-to-end tests are the most expensive tests to write, run, and maintain. Every e2e test you add is a maintenance liability — it exercises real infrastructure, it's sensitive to timing, it fails intermittently, and when it fails the failure message is usually less informative than the failure message from a unit or integration test.
Use them surgically. The right question is not "what should I cover with e2e tests?" but "what absolutely must be verified by running the full system together that can't be verified any other way?" The answer is usually the five to ten most critical user journeys: sign up, log in, complete the core action of the product, process a payment, complete onboarding.
Playwright is the modern default for browser-based e2e testing — fast, reliable, supports all major browsers, and has a strong API for writing tests that resist the brittleness that killed Selenium test suites. Cypress remains popular for teams that prefer its developer experience. For API-level e2e tests, Supertest, Hurl, or k6 smoke tests.
Flakiness is a critical severity issue, not a nuisance. The moment a test becomes intermittently failing, quarantine it — move it outside the blocking suite — and either fix it within a sprint or delete it. A test suite with 5% flakiness trains engineers to ignore red CI builds. Once that habit forms, you've lost the value of CI entirely.
Accessibility testing belongs in e2e. axe-core integrated into Playwright catches WCAG violations automatically on every page in your e2e suite. This is the cheapest and most reliable way to prevent accessibility regressions — it runs automatically on every PR without requiring anyone to remember to check.
Visual regression testing — Percy, Chromatic, or Playwright's screenshot comparison — catches unintended UI changes automatically. Useful for component libraries and design systems where visual regressions are a real risk. Run on PR, not on every commit.
Performance and security testing: in the pipeline, not before launch
Performance regressions are silent until they become production incidents. The database query that runs in 20ms against development data runs in 8 seconds against production data volume. The API endpoint that handles 10 requests per second fine starts timing out at 100. These problems are discoverable — if you look for them.
k6 is the modern tool for load testing in CI: write test scripts in JavaScript, define SLO thresholds (p95 latency must stay below 400ms), run in CI on every deploy to staging. A build that introduces a performance regression fails the gate automatically. This is qualitatively different from "we'll do a load test before we launch" — by the time you're preparing to launch, fixing a performance problem might require architectural changes.
SAST (Static Application Security Testing) in CI catches code-level security issues — injection vulnerabilities, unsafe deserialization, use of known-vulnerable functions. Semgrep with security rulesets, CodeQL for GitHub repositories, Bandit for Python. Run on every PR. Block only on high or critical findings — alert fatigue from low-severity findings trains engineers to ignore the scanner.
Dependency CVE scanning — Snyk, Dependabot, or npm audit / pip audit — flags known vulnerabilities in your dependencies on every build. Block on critical vulnerabilities with available fixes. Report others without blocking.
Where each layer runs: the quality pipeline
Testing layers have natural homes in the development lifecycle. Running everything everywhere makes CI slow. Running nothing early makes feedback slow. The right distribution:
Pre-commit (seconds): Linting, formatting, type checking, secrets scanning. Fast enough that nobody bypasses it.
On every push / PR (under 10 minutes): Unit tests, fast integration tests with Testcontainers, SAST, dependency CVE scan, build verification. This is the gate that provides the fast feedback loop on every change.
On merge to main (10–20 minutes): Full integration test suite, contract tests, e2e smoke tests, visual regression.
On deploy to staging (30–60 minutes): Full e2e suite, load test baseline, DAST, performance regression check.
Scheduled / periodic: Chaos experiments, soak tests, penetration testing.
The 10-minute rule for the PR gate is worth defending aggressively. A CI run that takes 45 minutes doesn't run 4.5 times slower than a 10-minute run in practice — it runs maybe ten times less often, because engineers find ways around it. Parallelise, cache, and split aggressively to keep the blocking gate under ten minutes.
What goes wrong when testing strategy is wrong
The false confidence problem. A codebase with high unit test coverage and heavy mocking provides confidence that turns out not to be warranted. The tests pass, the production incident happens anyway, and the investigation reveals that the mock was wrong. This is actively harmful — it's worse than low coverage because it actively misrepresents the risk.
The slow suite nobody runs. A test suite that takes forty-five minutes to run gets run once before a release, not on every PR. Bugs that could have been caught early instead get caught late. The slow suite becomes a bottleneck on delivery frequency.
The flaky suite nobody trusts. Tests that fail intermittently train engineers to ignore red builds. The pipeline loses its value as a signal. The first time a real failure gets dismissed as "probably just flakiness," the damage is done.
The e2e-first strategy. Rewriting your entire test suite as e2e tests because "they test the real thing." They do — and they'll make your CI forty-five minutes long, fail intermittently for infrastructure reasons, and make you merge less frequently.
Testing implementation rather than behaviour. Unit tests that test private methods, verify internal state, or assert on implementation details rather than outputs. These tests break on every refactor, not because the behaviour changed, but because the internals changed. Tests that test behaviour survive refactors and provide accurate signal.
If you do one thing from this post
Pick one integration test you currently write with mocked dependencies and rewrite it using Testcontainers — against a real database, a real Redis, or a real queue.
Run both versions. Compare what they catch. The Testcontainers version will take a few seconds longer to run. It will also catch a class of bugs the mocked version can't.
That comparison is the most compelling argument for changing your testing strategy, and it takes an afternoon.
Next up: Post 7 — How to Build a CI Pipeline That Engineers Actually Trust
← Post 5: The Coding Standards That Separate Confident Teams from Anxious Ones



