Skip to main content

Command Palette

Search for a command to run...

Distributed File Systems: File Storage Across Many Machines

Updated
11 min read
Distributed File Systems: File Storage Across Many Machines

Series: System Design · Data & Storage — Pillar 4 of 8

Systems Design

# Post What it covers
00 Data & Storage: Where Everything Lives Where data lives shapes everything about a system. Nineteen concepts covering databases, indexing, sharding, replication, and the data structures underneath. (161 chars)
01 SQL vs NoSQL: Choosing the Right Database SQL vs NoSQL isn't a simple choice. Learn what each type optimises for, when to use relational databases, and when NoSQL is the right call.
02 Database Indexing: The Highest-Leverage Performance Tool Indexes are the highest-leverage database performance tool. Learn how they work, what they cost, and how to decide when to add one.
03 B-Trees & B+ Trees: The Data Structure Behind Database Indexes Almost every database index is built on a B-tree or B+ tree. Learn how they work, why they're fast, and what this means for your queries.
04 LSM Trees: Why Some Databases Are Built for Writes LSM trees power Cassandra, RocksDB, and LevelDB. Learn how they achieve massive write throughput and what they trade off to get it.
05 Denormalisation: Trading Storage for Speed Denormalisation trades storage for read speed by pre-computing joins. Learn when it helps, when it hurts, and how to do it safely.
06 Database Sharding: Scaling Beyond a Single Node Sharding splits a database across multiple nodes. Learn how it works, the strategies available, and the significant tradeoffs it introduces.
07 Data Partitioning: Choosing How to Divide Your Data Range, hash, and list partitioning each make different tradeoffs. Learn how to divide data effectively for queries, maintenance, and scale.
08 Consistent Hashing: Minimising Resharding Pain Consistent hashing minimises data movement when nodes are added or removed. Learn how it works and why it's fundamental to distributed systems.
09 Replication & Read Replicas: Scaling Reads and Surviving Failures Replication copies data across nodes for fault tolerance and read scaling. Learn how primary-replica setups work and when to use them.
10 Object Storage: Unlimited Scale for Large Binary Data Object storage handles large binary files at unlimited scale. Learn how it works, why it replaced file servers, and when to use it.
11 Block vs File vs Object Storage: Three Models, Three Use Cases Three storage models, three different use cases. Learn what block, file, and object storage optimise for and how to choose between them.
12 Distributed File Systems: File Storage Across Many Machines ← you are here Distributed file systems spread file storage across many machines. Learn how HDFS, Ceph, and GlusterFS work and when to use them.
13 Time Series Databases: Built for Metrics and Events Time series databases handle append-heavy metric data far better than SQL. Learn how they work and when to use InfluxDB, Prometheus, or TimescaleDB.
14 Vector Databases: Semantic Search and AI Memory Vector databases power semantic search, recommendations, and LLM memory. Learn how embeddings work, what ANN search is, and when to use one.
15 Full-Text Search Engines: Beyond SQL LIKE Full-text search needs more than SQL LIKE. Learn how inverted indexes, relevance ranking, and Elasticsearch make text search fast and powerful.
16 Materialized Views: Pre-Computing Expensive Queries Materialized views cache expensive query results as physical tables. Learn how they work, when to refresh them, and when to use them vs other approaches.
17 Query Optimisation: From Slow to Fast Slow queries aren't always fixed by adding indexes. Learn how to read EXPLAIN output, understand query plans, and systematically make queries fast.
18 Connection Pooling: Managing the Hidden Bottleneck Opening a database connection per request doesn't scale. Learn how connection pooling works, what PgBouncer does, and how to size your pool correctly.
19 Data & Storage: Wrap-Up A recap of all 19 data storage concepts: SQL, NoSQL, indexing, sharding, replication, specialised databases, and how they connect in a real system.

Distributed File Systems: File Storage Across Many Machines

The problem

Your URL shortener now generates daily analytics exports — one compressed CSV per day per user, billions of rows. Over three years you've accumulated 40TB of exports. Your analytics pipeline (Apache Spark jobs) processes these files in parallel across a 20-node compute cluster: each node reads different date ranges, applies transformations, and writes output.

A single NFS server struggles under 20 simultaneous readers each reading 500MB/s. It becomes the bottleneck. A single node can't serve 20 × 500MB/s = 10GB/s of read throughput.

Object storage (S3) could hold the files, but Spark jobs that read from S3 must download each file over the network before processing it — compute and storage are separated, creating a shuffle bottleneck.

What you need is storage that scales horizontally alongside compute — where every compute node has fast local access to a portion of the data, the full dataset is spread across all nodes, and the system handles fault tolerance automatically.

This is what distributed file systems like HDFS were built for.


The core idea

A distributed file system stores files across multiple machines, presenting them as a single unified namespace. Files are split into blocks and distributed across storage nodes. Clients access any file through a single namespace, unaware of which nodes hold which blocks. Fault tolerance is achieved by replicating each block across multiple nodes.


The analogy: a library distributed across multiple buildings

A distributed file system is like a large university library where the books are split across multiple buildings by subject — engineering, medicine, law, science. A central catalogue knows which building holds which books. When you request a book, the catalogue tells you which building to visit; you go directly to that building.

For fault tolerance, each book is copied three times — one copy in the main building, two in separate satellite buildings. If one building burns down, the book is still available from the other copies.

Crucially, if you're in the engineering building, all engineering books are local. You don't walk to a different building for every page — you get large batches of related content from your local shelves. This is data locality: co-locating computation with the data it needs.


How it works

HDFS: the Hadoop Distributed File System

HDFS is the most widely understood distributed file system, designed as the storage layer for Apache Hadoop's MapReduce framework. Its architecture establishes patterns used by many successors.

NameNode (metadata server):

  • Stores the filesystem namespace: directory tree, file metadata, block-to-node mapping
  • Does not store actual file data — only metadata
  • Every client operation (open, rename, delete) goes through the NameNode
  • Critical bottleneck: if the NameNode is down, the entire filesystem is inaccessible (HDFS 2+ adds NameNode HA)

DataNodes (storage servers):

  • Store actual file data as blocks (default 128MB per block)
  • Report to the NameNode periodically (heartbeats + block reports)
  • Serve block reads and writes directly to clients after the NameNode provides block locations

Reading a file:

Client → NameNode: "Where are the blocks for /exports/2025-06-01.csv?"
NameNode → Client: "Block 1 is on DataNode A, B, C. Block 2 is on DataNode B, D, A."
Client → DataNode A: "Give me Block 1"
Client → DataNode B: "Give me Block 2"
(reads blocks in parallel from multiple DataNodes)

Large block size (128MB default): HDFS blocks are much larger than a typical filesystem (4KB). Large blocks amortise NameNode metadata overhead and reduce the number of distinct block locations the NameNode must track. They also encourage sequential reads — once a client starts reading a 128MB block, it reads it sequentially, which is fast on both spinning disks and SSDs.

Replication factor: each block is replicated across multiple DataNodes (default: 3). The NameNode tracks which DataNodes hold each replica. If a DataNode fails, the NameNode detects it (missed heartbeats), identifies under-replicated blocks, and instructs surviving DataNodes to re-replicate those blocks.

Data locality: when Spark (or another compute framework) submits a job, the scheduler tries to assign compute tasks to nodes that already hold the data. Reading from local disk is 10–100x faster than reading across the network. HDFS's block location information enables the scheduler to make this assignment.

Ceph

Ceph is a modern distributed storage platform that provides object, block, and file storage interfaces over the same underlying cluster — solving multiple storage problems with one system.

CRUSH algorithm: Ceph uses a deterministic placement algorithm (CRUSH — Controlled Replication Under Scalable Hashing) rather than a centralised metadata server. Given an object ID and the cluster's topology, CRUSH computes which nodes hold the object. Clients calculate the placement themselves — no central metadata bottleneck.

Three interfaces over one cluster:

  • RADOS (Reliable Autonomic Distributed Object Store): the underlying object store
  • RBD (RADOS Block Device): block storage backed by Ceph objects
  • CephFS: POSIX-compliant distributed file system using a metadata server (MDS) cluster
  • RADOSGW: S3/Swift-compatible object storage interface

Ceph is commonly used as the storage backend for OpenStack, Kubernetes persistent volumes, and on-premises cloud storage.

GlusterFS

GlusterFS is a POSIX-compliant distributed file system with no centralised metadata server. Instead, it uses an elastic hashing algorithm to determine which node holds each file. Files are distributed across storage "bricks" (directories on storage nodes) using configurable distribution strategies: distributed (files distributed across bricks), replicated (each file on N bricks), erasure-coded (file chunks across N bricks with recovery capability).

GlusterFS mounts as a regular filesystem — applications see a standard directory tree and use standard filesystem operations. It's commonly used for shared storage in virtualisation environments and for scale-out NAS.


When distributed file systems fit

Distributed file systems are the right choice when:

  • Compute-storage co-location matters. Spark and Hadoop workloads that process large files benefit enormously from data locality — reading from local storage rather than over the network.
  • Scale exceeds what single-node NFS can serve. Multiple compute nodes need simultaneous high-throughput access to the same dataset.
  • POSIX semantics are required. Applications that use open(), write(), seek() need a filesystem, not object storage.
  • On-premises infrastructure. Object storage (S3) is the dominant choice in cloud environments; distributed file systems are more relevant when building your own infrastructure.

Distributed file systems are NOT the right choice when:

  • You're on a public cloud. S3 + Spark's native S3 connector (with EMRFS or Delta Lake) has nearly eliminated the need for HDFS in cloud environments. Managing HDFS clusters on AWS EC2 is operationally heavy; S3 is nearly free to operate.
  • You need random read/write access at high IOPS. Distributed file systems optimise for sequential large-file reads. They're poor at the random I/O patterns that databases require.
  • You need object storage semantics. HTTP access, presigned URLs, lifecycle policies — use S3.
  • Simplicity matters. A distributed file system cluster is operationally complex. S3 is not.

The tradeoffs

HDFS: proven at massive scale (Facebook and Yahoo ran multi-petabyte HDFS clusters). But the NameNode is a central point of failure and a metadata bottleneck (HDFS HA improves this but adds complexity). Large block sizes mean small files are storage-inefficient — a 1KB file still occupies a 128MB block.

Ceph: flexible (one cluster for object, block, and file workloads) and has no central metadata server. More complex to operate and tune than HDFS or object storage. The CRUSH algorithm is powerful but requires careful planning when cluster topology changes.

GlusterFS: simpler to set up than Ceph. No central metadata server. Less proven at the largest scales. Performance can degrade with many small files.

All distributed file systems vs S3:

  • DFS: lower latency for large sequential reads, data locality for compute, POSIX semantics
  • S3: operationally simpler, infinitely scalable, costs nothing to operate, but compute-storage separation adds network I/O

The one thing to remember

Distributed file systems solve the problem of file storage that must scale beyond a single server while supporting POSIX semantics and data locality for compute workloads. In cloud environments, managed object storage (S3) has largely displaced HDFS for new workloads because the operational overhead of managing a distributed file system cluster rarely justifies the data locality benefit when compute-storage network speeds are fast. Reach for a distributed file system when you're running on-premises, need POSIX semantics, or your workload's performance critically depends on reading data from local storage rather than across the network.


← Previous: Block vs File vs Object Storage: Three Models, Three Use Cases — Three storage models, three different use cases. Learn what block, file, and object storage optimise for and how to c...

→ Next: Time Series Databases: Built for Metrics and Events — Time series databases handle append-heavy metric data far better than SQL. Learn how they work and when to use Influx...

Systems Design

Part 1 of 50

Understanding these system design concepts is essential for architects, developers, and engineers to create scalable, reliable, and maintainable software systems that meet the needs of businesses.

More from this blog

Cloud Tuned

729 posts

Your starting point for anything cloud: AWS, Azure, GCP, Serverless, Architecture, Hybrid Cloud, Systems Design and other Information Technology topics.