Skip to main content

Command Palette

Search for a command to run...

Service Discovery: Finding Services in a Dynamic Environment

Updated
9 min read
Service Discovery: Finding Services in a Dynamic Environment

Series: System Design · Architecture Patterns — Pillar 7 of 8

Systems Design

# Post What it covers
00 Architecture Patterns: How Systems Are Structured Twenty patterns covering monoliths, microservices, events, resilience, deployment, and data processing. How to structure systems that survive growth.
01 Monolithic Architecture: The Default That Gets Abandoned Too Early Monoliths are fast to build and easy to operate. Learn when they're the right choice, when they break down, and how to know the difference.
02 Microservices: The Architecture You Earn, Not Choose Microservices enable independent scaling and team autonomy — but at significant cost. Learn what you actually get, what you pay, and when it's worth it.
03 Serverless: Pay for What You Use, Not What You Provision Serverless scales to zero and charges per invocation. Learn where it shines, where it fails, and how to design around cold starts and vendor lock-in.
04 Event-Driven Architecture: Decoupling Through Events Event-driven systems communicate via events rather than direct calls. Learn how producers, consumers, and event brokers work — and the consistency tradeoffs involved.
05 Message Queues: Decoupling Produce from Consume Message queues decouple producers and consumers, enable load levelling, and provide durability. Learn how they work and when to use Kafka vs SQS vs RabbitMQ.
06 Pub/Sub: Broadcasting Events to Multiple Consumers Pub/sub decouples publishers from subscribers through topics. Learn how it differs from message queues and when to use Kafka, SNS, or Google Pub/Sub.
07 CQRS: When Reads and Writes Need Different Models CQRS separates writes from reads so each can be optimised independently. Learn how it works, when it's worth the complexity, and when it isn't.
08 Event Sourcing: The Ledger, Not the Balance Event sourcing stores state as a sequence of events. Learn how it works, what you get (audit log, time travel), and what it costs (complexity, schema evolution).
09 The Saga Pattern: Distributed Transactions Without Locks The Saga pattern manages distributed transactions across services using compensating transactions. Learn choreography vs orchestration and when to use each.
10 The Outbox Pattern: Atomic Writes and Event Publishing The Outbox pattern solves the dual-write problem — publishing an event and writing to a database atomically. Learn how it works using CDC or polling.
11 The Circuit Breaker: Stopping Cascading Failures Circuit breakers prevent cascading failures by fast-failing calls to unhealthy dependencies. Learn the three states, how to configure them, and where to apply them.
12 The Bulkhead Pattern: Containing Failures Through Resource Isolation Bulkheads isolate thread pools and connections per dependency so one failure can't exhaust resources needed by others. Learn how to apply them in practice.
13 The Sidecar Pattern: Cross-Cutting Concerns Without Code Changes The sidecar pattern deploys a helper process alongside each service for logging, metrics, TLS, and service discovery — without modifying the service itself.
14 Service Mesh: A Programmable Network for Microservices A service mesh handles service-to-service traffic, mTLS, circuit breaking, and observability via a fleet of sidecar proxies. Learn how it works and when to use it.
15 Service Discovery: Finding Services in a Dynamic Environment ← you are here Service discovery lets services find each other in dynamic environments. Learn client-side vs server-side discovery, health checks, and DNS vs registry approaches.
16 The Strangler Fig: Replacing a Legacy System Without Burning It Down The Strangler Fig replaces a legacy system incrementally by routing specific functionality to new implementations while the old system keeps running.
17 Backend for Frontend: One API Per Client Type BFF creates dedicated API backends per client type. Learn why one general API struggles to serve mobile and web well, and how BFF solves it.
18 ETL Pipelines: Moving Data from Operations to Analytics ETL moves data from operational systems into analytical stores. Learn how pipelines work, what ELT is, and how to design reliable data movement at scale.
19 Batch vs Stream Processing: How Fresh Do Your Answers Need to Be? Batch processes accumulate data then processes in bulk; streaming processes each event as it arrives. Learn the tradeoffs and when each is right.
20 MapReduce: Processing Petabytes in Parallel MapReduce processes massive datasets in parallel by splitting work into map and reduce phases. Learn how it works and why Spark has largely replaced it.
21 Architecture Patterns: Wrap-Up A recap of all 20 architecture patterns across decomposition, async communication, data patterns, resilience, and data processing. How they connect.

Service Discovery: Finding Services in a Dynamic Environment

The problem

In a static deployment, you configure service addresses in a config file: analytics-service=10.0.1.45:8080. This works until the analytics service pod restarts. In Kubernetes, the new pod gets a new IP address. Your hardcoded config is immediately stale. Every service that calls analytics must be reconfigured.

At scale, this problem compounds: hundreds of services, each with multiple instances, each instance potentially changing IP on every deployment or failure. Manual configuration management is impossible. You need a mechanism that automatically tracks which instances of each service are healthy and where they're running.


The core idea

Service discovery is the mechanism by which services dynamically locate the network addresses of other services they need to call. In a dynamic environment (containers, Kubernetes, auto-scaling), service instances start, stop, and change addresses constantly. A service registry tracks the current healthy instances of each service; clients query the registry to find where to send requests.


The analogy: a phone directory with real-time updates

A static config file is a printed phone directory — accurate on print day, stale immediately. Service discovery is an online directory that's updated in real time as numbers change, with a filter that hides disconnected numbers.


How service discovery works

Service registration

When a service instance starts, it registers itself with the service registry: "I am analytics-service, I'm at 10.0.2.73:8080, and I'm healthy."

The registry stores: service name → list of (address, port, metadata) for healthy instances.

When an instance stops (gracefully or due to failure), it deregisters — or the registry detects it failed its health check and removes it.

Health checks

The registry periodically probes each registered instance to verify it's still healthy. An instance that fails health checks is removed from the registry and no longer receives traffic.

Client-side discovery

The client queries the registry to get a list of healthy instances, then picks one (using round robin, random, or load-aware selection) and connects directly.

Used by: Netflix Eureka, Consul (with client-side load balancing via Ribbon).

Server-side discovery

The client sends requests to a load balancer or service proxy. The proxy queries the registry and forwards to a healthy instance. The client doesn't know the registry exists.

Used by: Kubernetes Services (kube-proxy), AWS ALB with ECS service discovery, Envoy with Consul.

DNS-based discovery

The simplest form of server-side discovery: the service registry updates DNS entries. Clients resolve the service name via DNS; the DNS server returns only healthy instance IPs.

Kubernetes uses DNS natively: analytics-service.default.svc.cluster.local resolves to the ClusterIP of the analytics-service Service. kube-proxy maintains iptables rules that load-balance across healthy pod IPs behind the ClusterIP.

This is the default in Kubernetes and requires zero application code.


Kubernetes native service discovery

In Kubernetes, service discovery is handled automatically:

Service object: a stable virtual IP (ClusterIP) that kube-proxy routes to healthy pods matching a label selector.

apiVersion: v1
kind: Service
metadata:
  name: analytics-service
spec:
  selector:
    app: analytics        # routes to pods with this label
  ports:
  - port: 8080

DNS: CoreDNS resolves analytics-service to the ClusterIP. Any pod in the cluster can reach analytics at analytics-service:8080 — the address is stable even as pods come and go.

Headless services: skip the ClusterIP; DNS returns individual pod IPs directly. Used for stateful services (databases) where clients need to connect to specific instances.


Tradeoffs

Registry availability. If the service registry is unavailable, services can't discover new instances. Cached results help for a period but go stale. Service registry must be highly available (typically a replicated cluster — Consul Raft, Kubernetes etcd).

DNS TTL caching. DNS-based discovery relies on clients respecting TTL. Clients that cache DNS responses for too long may route to stale (failed) instances. Short TTLs (5–30 seconds) reduce staleness but increase DNS query load.

Client-side vs server-side tradeoffs. Client-side discovery couples each service to the registry client library — every service needs the library in its language. Server-side discovery (via a proxy or load balancer) keeps this out of application code but adds a hop.


The one thing to remember

Service discovery solves the problem of services finding each other in a dynamic environment where instance addresses change constantly. In Kubernetes, it's largely automatic: Service objects with DNS provide stable addresses, and kube-proxy handles load balancing to healthy pods behind the scenes. In non-Kubernetes environments, Consul or a similar registry provides registration, health checking, and a queryable directory. Don't hardcode service addresses — any static config you write will be wrong the moment the target service restarts.


← Previous: Service Mesh — when every service has a sidecar, you have a service mesh; here's how the control plane manages that fleet.

→ Next: Strangler Fig — the safe way to migrate a legacy system incrementally, without a risky big-bang rewrite.

Systems Design

Part 1 of 50

Understanding these system design concepts is essential for architects, developers, and engineers to create scalable, reliable, and maintainable software systems that meet the needs of businesses.

More from this blog

Cloud Tuned

751 posts

Your starting point for anything cloud: AWS, Azure, GCP, Serverless, Architecture, Hybrid Cloud, Systems Design and other Information Technology topics.