Skip to main content

Command Palette

Search for a command to run...

Latency vs Throughput vs Bandwidth: Three Numbers, One System

Updated
10 min read
Latency vs Throughput vs Bandwidth: Three Numbers, One System

Foundations Series

# Post What it covers
00 Intro What the Foundations pillar covers and why it matters
01 Availability Uptime, the nines, and why 99% isn't good enough
02 Reliability Correctness over time — when uptime isn't enough
03 Latency vs Throughput vs Bandwidth ← you are here The three numbers that define system performance
04 ACID vs BASE Two philosophies for handling data under pressure
05 CAP Theorem The impossibility result every distributed system runs into
06 PACELC Theorem What CAP doesn't tell you about latency
07 Consistency Models The spectrum from "always correct" to "eventually correct"
08 Single Point of Failure Why one weak link breaks the whole chain
09 High Availability vs Fault Tolerance Similar goals, very different strategies
10 Wrap-up How all nine concepts connect

Latency vs Throughput vs Bandwidth: Three Numbers, One System

The problem

Your API is slow. You throw more servers at it — no improvement. You upgrade the database — still slow. You add a caching layer — better, but not solved. Three weeks and two engineer-months later, someone finally runs a network diagnostic and finds the bottleneck: every response is fine in isolation, but the pipe connecting your data centre to your CDN is saturated during peak hours.

You were fixing latency. The problem was bandwidth. They're not the same thing, and the fix for one rarely helps the other.

Performance debugging is one of the most expensive exercises in engineering precisely because teams routinely conflate these three numbers. You can't optimise what you haven't correctly diagnosed — and diagnosing correctly starts with knowing which metric you're actually measuring.


The core idea

Latency is how long a single request takes from start to finish. Throughput is how many requests a system can handle per unit of time. Bandwidth is the maximum capacity of the network pipe — how much data it can carry.

They're related. They're not the same. And optimising one without understanding the others can make the overall system worse.


The analogy: a motorway

A motorway makes all three concrete:

  • Bandwidth is the number of lanes. It's a physical property of the infrastructure — you can't send more cars than the road has capacity for.

  • Throughput is how many cars pass a given point per hour, measured in practice. A four-lane motorway can theoretically move 8,000 cars per hour; at rush hour with an accident in lane two, it might move 2,000.

  • Latency is how long it takes one specific car to drive from junction 1 to junction 10. This depends on speed limits, traffic density, roadworks, and whether that car has to stop at a service station.

The motorway can have high bandwidth (eight lanes) and terrible latency (roadworks every two miles). It can have low bandwidth (two lanes) and high throughput (all lanes moving fast, no incidents). Understanding which constraint is active tells you which fix to apply: add lanes (bandwidth), clear the roadworks (latency), or improve traffic flow management (throughput).


How each one works

Latency

Latency is measured as the elapsed time between a request being sent and the response being received. It's expressed in milliseconds (ms) or microseconds (μs) for very fast systems.

The components of latency in a typical web request:

Total latency = Network transit time
              + Server processing time
              + Database query time
              + Serialisation/deserialisation time
              + Queue wait time (if any)

Every hop a request makes adds to latency. A request from Sydney to a server in us-east-1 carries roughly 180–200ms of irreducible network latency before a single line of your code runs — the speed of light imposes a floor.

Latency percentiles matter more than averages. Average latency is misleading because a small number of very slow requests can coexist with mostly fast ones. The industry standard is to report p50 (median), p95, and p99:

Percentile What it means
p50 Half of requests are faster than this
p95 95% of requests are faster than this
p99 99% of requests are faster than this

A system with p50 latency of 20ms and p99 latency of 4,000ms feels fast for most users and catastrophically slow for one in a hundred. The p99 is the number that reveals the real experience — and the one that most alerting systems ignore.

The tail latency problem. In distributed systems, a single user request often fans out to dozens of downstream calls. If each downstream service has 1% of requests taking 2 seconds, a request that touches ten services has roughly a 10% chance of hitting at least one slow response. At scale, tail latency dominates the user experience.

Throughput

Throughput is the rate at which a system successfully processes requests — measured in requests per second (RPS), transactions per second (TPS), or bytes per second depending on context.

Throughput is a system-level property, not a per-request one. A single server might process requests in 10ms each, giving a theoretical maximum throughput of 100 RPS. But under real load — connection overhead, garbage collection pauses, lock contention, database connection pool limits — actual throughput is typically lower than the theoretical ceiling.

The relationship between latency and throughput under load:

At low traffic, latency is stable and throughput scales linearly with load. As you approach system capacity, a characteristic curve emerges:

Low load:    Latency flat, throughput scales linearly
Near limit:  Latency starts climbing, throughput plateaus  
At limit:    Latency spikes, throughput drops (queueing effects)
Over limit:  System saturates, latency unbounded, throughput collapses

This is why load testing to destruction is important. The system that handles 500 RPS gracefully might fall apart at 600 — not degrade gently, but collapse.

Throughput vs latency is often a genuine tradeoff. Batching is the clearest example: processing 1,000 database writes as a single batch operation dramatically improves throughput (fewer round trips, better use of I/O) but increases the latency for any individual write that has to wait for the batch to fill. Systems that need both low latency and high throughput often run separate paths: a synchronous path for latency-sensitive operations and a batch path for throughput-sensitive ones.

Bandwidth

Bandwidth is the theoretical maximum data transfer rate of a network link, measured in bits per second (Mbps, Gbps). It's a property of the infrastructure — your ISP connection, your data centre uplink, the link between two services.

The key distinction: bandwidth is capacity, throughput is utilisation. A 1Gbps link that's transferring 400Mbps of data is running at 40% bandwidth utilisation. Throughput (actual transfer rate) is always less than or equal to bandwidth (maximum possible rate). The gap between the two is caused by protocol overhead, congestion, retransmissions, and contention from multiple streams sharing the link.

Bandwidth becomes the constraint when:

  • You're transferring large objects (images, video, large API payloads)

  • Multiple high-volume services share a network segment

  • Your CDN or data centre uplink is undersized for peak traffic

  • You're doing cross-region replication of large datasets

Bandwidth is often invisible until it isn't. A service handling small JSON payloads at 10,000 RPS uses roughly 10MB/s — trivial. The same service returning average payloads of 1MB each at 10,000 RPS needs 10GB/s — a very different infrastructure conversation.


How they interact

The three numbers are connected by a relationship that's easy to state and easy to forget:

Throughput ≤ Bandwidth / Average response size

A 1Gbps link (125MB/s) carrying responses that average 25KB each has a theoretical throughput ceiling of 5,000 RPS from bandwidth alone — regardless of how fast your servers process requests.

The practical interactions:

Increasing throughput can increase latency. More concurrent requests means more contention for shared resources: CPU, database connections, locks. Each request waits longer. This is the fundamental tension in system capacity planning.

Reducing response size increases effective bandwidth. Compression, pagination, and response trimming (returning only requested fields) all reduce payload size — which stretches the same bandwidth further and reduces transmission latency.

Latency improvements don't always improve throughput. If your system is CPU-bound and you reduce database query time, each request finishes faster — but if the CPU is already the bottleneck, you haven't increased how many requests per second the system can handle.


Diagnosing which constraint you have

When a system "feels slow," the first step is identifying which number is the actual constraint:

It's a latency problem if: individual requests are slow regardless of load. One user making one request takes 2 seconds. Profiling a single request reveals a slow database query, a downstream service call with high latency, or unnecessary sequential processing that could be parallelised.

It's a throughput problem if: the system is fine under low load and degrades under high load. Individual requests are fast when traffic is light; everything slows when traffic spikes. The fix is usually horizontal scaling, connection pooling, caching, or queueing.

It's a bandwidth problem if: large payloads are slow but small ones are fast. The degradation correlates with response size, not request complexity. Network utilisation metrics confirm the link is near saturation.

Single request slow at any load?     → Latency problem
Fast at low load, slow under traffic? → Throughput problem  
Slow for large payloads specifically? → Bandwidth problem
All three at once?                    → You have a fun week ahead

The tradeoffs

Latency vs throughput: batching, buffering, and queueing all trade higher per-item latency for better system-level throughput. Right for background processing; wrong for user-facing requests.

Throughput vs consistency: the fastest way to increase write throughput is to write asynchronously and acknowledge before data is fully persisted. This creates a window where data could be lost. Whether that tradeoff is acceptable depends entirely on what you're writing.

Bandwidth vs cost: bandwidth has a price. CDNs, egress fees, cross-region data transfer — optimising for bandwidth efficiency (compression, caching, smarter payload design) directly affects infrastructure cost at scale.


The one thing to remember

Average latency lies. Throughput ceilings are invisible until crossed. Bandwidth limits are silent until saturated. Measure all three, report latency at p95 and p99, load-test to find throughput ceilings before production finds them for you, and monitor bandwidth utilisation before a traffic spike turns your architecture diagram into a post-mortem.


← Previous: Reliability — whether your system responds correctly

→ Next: ACID vs BASE — now that you know how fast data moves, the next question is what guarantees your database makes about that data when things go wrong.

Systems Design

Part 13 of 18

Understanding these system design concepts is essential for architects, developers, and engineers to create scalable, reliable, and maintainable software systems that meet the needs of businesses.

Up next

ACID vs BASE: Two Philosophies for Data Under Pressure

Foundations Series # Post What it covers 00 Intro What the Foundations pillar covers and why it matters 01 Availability Uptime, the nines, and why 99% isn't good enough 02 Reliability Correc

More from this blog

Cloud Tuned

644 posts

Your starting point for anything cloud: AWS, Azure, GCP, Serverless, Architecture, Hybrid Cloud, Systems Design and other Information Technology topics.