MapReduce: Processing Petabytes in Parallel

Series: System Design · Architecture Patterns — Pillar 7 of 8
Systems Design
| # | Post | What it covers |
|---|---|---|
| 00 | Architecture Patterns: How Systems Are Structured | Twenty patterns covering monoliths, microservices, events, resilience, deployment, and data processing. How to structure systems that survive growth. |
| 01 | Monolithic Architecture: The Default That Gets Abandoned Too Early | Monoliths are fast to build and easy to operate. Learn when they're the right choice, when they break down, and how to know the difference. |
| 02 | Microservices: The Architecture You Earn, Not Choose | Microservices enable independent scaling and team autonomy — but at significant cost. Learn what you actually get, what you pay, and when it's worth it. |
| 03 | Serverless: Pay for What You Use, Not What You Provision | Serverless scales to zero and charges per invocation. Learn where it shines, where it fails, and how to design around cold starts and vendor lock-in. |
| 04 | Event-Driven Architecture: Decoupling Through Events | Event-driven systems communicate via events rather than direct calls. Learn how producers, consumers, and event brokers work — and the consistency tradeoffs involved. |
| 05 | Message Queues: Decoupling Produce from Consume | Message queues decouple producers and consumers, enable load levelling, and provide durability. Learn how they work and when to use Kafka vs SQS vs RabbitMQ. |
| 06 | Pub/Sub: Broadcasting Events to Multiple Consumers | Pub/sub decouples publishers from subscribers through topics. Learn how it differs from message queues and when to use Kafka, SNS, or Google Pub/Sub. |
| 07 | CQRS: When Reads and Writes Need Different Models | CQRS separates writes from reads so each can be optimised independently. Learn how it works, when it's worth the complexity, and when it isn't. |
| 08 | Event Sourcing: The Ledger, Not the Balance | Event sourcing stores state as a sequence of events. Learn how it works, what you get (audit log, time travel), and what it costs (complexity, schema evolution). |
| 09 | The Saga Pattern: Distributed Transactions Without Locks | The Saga pattern manages distributed transactions across services using compensating transactions. Learn choreography vs orchestration and when to use each. |
| 10 | The Outbox Pattern: Atomic Writes and Event Publishing | The Outbox pattern solves the dual-write problem — publishing an event and writing to a database atomically. Learn how it works using CDC or polling. |
| 11 | The Circuit Breaker: Stopping Cascading Failures | Circuit breakers prevent cascading failures by fast-failing calls to unhealthy dependencies. Learn the three states, how to configure them, and where to apply them. |
| 12 | The Bulkhead Pattern: Containing Failures Through Resource Isolation | Bulkheads isolate thread pools and connections per dependency so one failure can't exhaust resources needed by others. Learn how to apply them in practice. |
| 13 | The Sidecar Pattern: Cross-Cutting Concerns Without Code Changes | The sidecar pattern deploys a helper process alongside each service for logging, metrics, TLS, and service discovery — without modifying the service itself. |
| 14 | Service Mesh: A Programmable Network for Microservices | A service mesh handles service-to-service traffic, mTLS, circuit breaking, and observability via a fleet of sidecar proxies. Learn how it works and when to use it. |
| 15 | Service Discovery: Finding Services in a Dynamic Environment | Service discovery lets services find each other in dynamic environments. Learn client-side vs server-side discovery, health checks, and DNS vs registry approaches. |
| 16 | The Strangler Fig: Replacing a Legacy System Without Burning It Down | The Strangler Fig replaces a legacy system incrementally by routing specific functionality to new implementations while the old system keeps running. |
| 17 | Backend for Frontend: One API Per Client Type | BFF creates dedicated API backends per client type. Learn why one general API struggles to serve mobile and web well, and how BFF solves it. |
| 18 | ETL Pipelines: Moving Data from Operations to Analytics | ETL moves data from operational systems into analytical stores. Learn how pipelines work, what ELT is, and how to design reliable data movement at scale. |
| 19 | Batch vs Stream Processing: How Fresh Do Your Answers Need to Be? | Batch processes accumulate data then processes in bulk; streaming processes each event as it arrives. Learn the tradeoffs and when each is right. |
| 20 | MapReduce: Processing Petabytes in Parallel ← you are here | MapReduce processes massive datasets in parallel by splitting work into map and reduce phases. Learn how it works and why Spark has largely replaced it. |
| 21 | Architecture Patterns: Wrap-Up | A recap of all 20 architecture patterns across decomposition, async communication, data patterns, resilience, and data processing. How they connect. |
MapReduce: Processing Petabytes in Parallel
The problem
You want to compute the total click count for every country code in the URL shortener's history — across three years of click data, 500 billion rows, 40 terabytes. Running this as a single SQL query against Cassandra would take days (if it didn't time out first). The data doesn't fit in a single machine's memory. A single-threaded scan would take weeks.
The computation itself is trivial: for each click event, read the country code; count how many times each country code appears. But at this scale, the challenge isn't the logic — it's the parallelisation.
If the 40TB dataset were split across 200 machines, and each machine processed its 200GB slice independently, the problem finishes 200x faster. Then the partial counts from each machine are combined into a final total. This is MapReduce.
The core idea
MapReduce is a programming model for processing large datasets in parallel across a distributed cluster. The computation is split into two phases: a Map phase (each machine processes its slice of data independently, emitting key-value pairs) and a Reduce phase (the emitted pairs are grouped by key and aggregated). The framework handles data distribution, parallelisation, fault tolerance, and the "shuffle" step between map and reduce.
The analogy: counting votes by district then nationally
A national election counts millions of votes. The approach:
Map phase (district counting): each district (machine) counts its own ballot papers, producing a tally per candidate per district.
Shuffle (centralise by candidate): all tallies for Candidate A go to one counter, all tallies for Candidate B go to another. The framework handles this redistribution.
Reduce phase (national totals): each counter sums their candidate's tallies from all districts, producing a final national total.
Each phase is embarrassingly parallel within itself. The shuffle is the expensive coordination step. The final result is assembled from partial results.
How MapReduce works
The two phases
Map function: takes an input record, applies a transformation, and emits zero or more (key, value) pairs.
def map_function(record):
# Input: one click event row
# Output: (country_code, 1) for each click
country = record["country_code"]
yield (country, 1)
Reduce function: receives all values for a given key and aggregates them into a single result.
def reduce_function(country_code, values):
# Input: country_code = "AU", values = [1, 1, 1, 1, ...]
# Output: ("AU", total_count)
yield (country_code, sum(values))
The full execution
40 TB dataset split into 200 chunks of 200 GB
↓
Map phase (parallel, 200 machines):
Machine 1: reads its 200 GB → emits (AU, 1), (US, 1), (GB, 1), ...
Machine 2: reads its 200 GB → emits (AU, 1), (IN, 1), (US, 1), ...
...
Machine 200: reads its 200 GB → emits ...
Shuffle (framework-managed):
All (AU, 1) pairs → Reducer for "AU"
All (US, 1) pairs → Reducer for "US"
(This redistribution phase writes to disk — expensive)
Reduce phase (parallel, one reducer per country):
Reducer for "AU": sum([1, 1, 1, ...]) → ("AU", 8,432,091,234)
Reducer for "US": sum([1, 1, 1, ...]) → ("US", 31,204,983,221)
...
Each phase runs in parallel. The bottleneck is the shuffle — all mappers must complete before reducers can start, and the shuffle writes massive amounts of data to disk.
Fault tolerance
A key feature of MapReduce is fault tolerance. If a machine dies mid-computation, the framework re-runs that machine's tasks on a different machine. The computation is idempotent — rerunning a map task on the same input produces the same output. This makes MapReduce reliable on commodity hardware with a high failure rate.
Hadoop MapReduce vs Spark
Hadoop MapReduce writes intermediate results (the shuffle output) to HDFS between map and reduce phases. For multi-step jobs (map → reduce → reduce), this means multiple expensive disk writes and reads. A three-step MapReduce job writes to disk six times.
Apache Spark keeps intermediate results in memory where possible, sharing them across stages. A three-step Spark job may write to disk only once (for fault tolerance checkpoints). Spark is typically 10–100x faster than Hadoop MapReduce for iterative and multi-stage computations.
Spark also supports more general computation patterns (not just map-and-reduce), SQL queries (Spark SQL), streaming (Structured Streaming), machine learning (MLlib), and graph processing (GraphX) — all within the same API.
Hadoop MapReduce is largely legacy. New systems are built on Spark or managed cloud services (Google Dataflow, AWS Glue, Databricks). The MapReduce programming model remains important to understand — Spark's DAG execution model is conceptually its descendant — but you're unlikely to write Hadoop MapReduce jobs for new work.
When MapReduce (or Spark) is the answer
MapReduce / Spark is the right tool when:
- The dataset is too large to fit in one machine's memory
- The computation can be parallelised across data partitions
- Results don't need to be real-time (batch processing)
Examples:
- Aggregate click counts across years of history
- Training a machine learning model on billions of examples
- Computing graph metrics across a social network with billions of edges
- Re-running a corrected analytics pipeline over historical data
Not the right tool:
- Low-latency queries (use a pre-built index, a data warehouse, or Cassandra)
- Real-time stream processing (use Flink, Kafka Streams)
- Small datasets that fit in one machine's memory (just use SQL or Python pandas)
The one thing to remember
MapReduce enables parallel processing of massive datasets by splitting computation into two stages: Map (process each data partition independently, emit key-value pairs) and Reduce (aggregate all values for each key). The framework handles distribution, parallel execution, the shuffle (grouping by key), and fault recovery. Spark has superseded Hadoop MapReduce in practice by keeping intermediate data in memory — 10–100x faster for multi-stage jobs. Understanding the model (split, transform, shuffle, aggregate) is fundamental to reasoning about any large-scale batch data processing system.
← Previous: Batch vs Stream Processing — ETL introduced the concept of batch vs streaming data movement; this post covers the architectural choice in depth.
→ Next: Architecture Patterns — Wrap-up — twenty concepts tied together, end-to-end architecture of the URL shortener, and bridge to the final pillar.



